-
Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs
Authors:
Ashwinee Panda,
Berivan Isik,
Xiangyu Qi,
Sanmi Koyejo,
Tsachy Weissman,
Prateek Mittal
Abstract:
Existing methods for adapting large language models (LLMs) to new tasks are not suited to multi-task adaptation because they modify all the model weights -- causing destructive interference between tasks. The resulting effects, such as catastrophic forgetting of earlier tasks, make it challenging to obtain good performance on multiple tasks at the same time. To mitigate this, we propose Lottery Ti…
▽ More
Existing methods for adapting large language models (LLMs) to new tasks are not suited to multi-task adaptation because they modify all the model weights -- causing destructive interference between tasks. The resulting effects, such as catastrophic forgetting of earlier tasks, make it challenging to obtain good performance on multiple tasks at the same time. To mitigate this, we propose Lottery Ticket Adaptation (LoTA), a sparse adaptation method that identifies and optimizes only a sparse subnetwork of the model. We evaluate LoTA on a wide range of challenging tasks such as instruction following, reasoning, math, and summarization. LoTA obtains better performance than full fine-tuning and low-rank adaptation (LoRA), and maintains good performance even after training on other tasks -- thus, avoiding catastrophic forgetting. By extracting and fine-tuning over lottery tickets (or sparse task vectors), LoTA also enables model merging over highly dissimilar tasks. Our code is made publicly available at https://github.com/kiddyboots216/lottery-ticket-adaptation.
△ Less
Submitted 25 June, 2024; v1 submitted 24 June, 2024;
originally announced June 2024.
-
Lossy Compression for Schrödinger-style Quantum Simulations
Authors:
Noah Huffman,
Dmitri Pavlichin,
Tsachy Weissman
Abstract:
Simulating quantum circuits on classical hardware is a powerful and necessary tool for develo** and testing quantum algorithms and hardware as well as evaluating claims of quantum supremacy in the Noisy Intermediate-Scale Quantum (NISQ) regime. Schrödinger-style simulations are limited by the exponential growth of the number of state amplitudes which need to be stored. In this work, we apply sca…
▽ More
Simulating quantum circuits on classical hardware is a powerful and necessary tool for develo** and testing quantum algorithms and hardware as well as evaluating claims of quantum supremacy in the Noisy Intermediate-Scale Quantum (NISQ) regime. Schrödinger-style simulations are limited by the exponential growth of the number of state amplitudes which need to be stored. In this work, we apply scalar and vector quantization to Schrödinger-style quantum circuit simulations as lossy compression schemes to reduce the number of bits needed to simulate quantum circuits. Using quantization, we can maintain simulation fidelities $>0.99$ when simulating the Quantum Fourier Transform, while using only 7 significand bits in a floating-point number to characterize the real and imaginary components of each amplitude. Furthermore, using vector quantization, we propose a method to bound the number of bits/amplitude needed to store state vectors in a simulation of a circuit that achieves a desired fidelity, and show that for a 6 qubit simulation of the Quantum Fourier Transform, 15 bits/amplitude is sufficient to maintain fidelity $>0.9$ at $10^4$ depth.
△ Less
Submitted 1 March, 2024; v1 submitted 19 January, 2024;
originally announced January 2024.
-
Adaptive Compression in Federated Learning via Side Information
Authors:
Berivan Isik,
Francesco Pase,
Deniz Gunduz,
Sanmi Koyejo,
Tsachy Weissman,
Michele Zorzi
Abstract:
The high communication cost of sending model updates from the clients to the server is a significant bottleneck for scalable federated learning (FL). Among existing approaches, state-of-the-art bitrate-accuracy tradeoffs have been achieved using stochastic compression methods -- in which the client $n$ sends a sample from a client-only probability distribution $q_{φ^{(n)}}$, and the server estimat…
▽ More
The high communication cost of sending model updates from the clients to the server is a significant bottleneck for scalable federated learning (FL). Among existing approaches, state-of-the-art bitrate-accuracy tradeoffs have been achieved using stochastic compression methods -- in which the client $n$ sends a sample from a client-only probability distribution $q_{φ^{(n)}}$, and the server estimates the mean of the clients' distributions using these samples. However, such methods do not take full advantage of the FL setup where the server, throughout the training process, has side information in the form of a global distribution $p_θ$ that is close to the clients' distribution $q_{φ^{(n)}}$ in Kullback-Leibler (KL) divergence. In this work, we exploit this closeness between the clients' distributions $q_{φ^{(n)}}$'s and the side information $p_θ$ at the server, and propose a framework that requires approximately $D_{KL}(q_{φ^{(n)}}|| p_θ)$ bits of communication. We show that our method can be integrated into many existing stochastic compression frameworks to attain the same (and often higher) test accuracy with up to $82$ times smaller bitrate than the prior work -- corresponding to 2,650 times overall compression.
△ Less
Submitted 21 April, 2024; v1 submitted 21 June, 2023;
originally announced June 2023.
-
Exact Optimality of Communication-Privacy-Utility Tradeoffs in Distributed Mean Estimation
Authors:
Berivan Isik,
Wei-Ning Chen,
Ayfer Ozgur,
Tsachy Weissman,
Albert No
Abstract:
We study the mean estimation problem under communication and local differential privacy constraints. While previous work has proposed \emph{order}-optimal algorithms for the same problem (i.e., asymptotically optimal as we spend more bits), \emph{exact} optimality (in the non-asymptotic setting) still has not been achieved. In this work, we take a step towards characterizing the \emph{exact}-optim…
▽ More
We study the mean estimation problem under communication and local differential privacy constraints. While previous work has proposed \emph{order}-optimal algorithms for the same problem (i.e., asymptotically optimal as we spend more bits), \emph{exact} optimality (in the non-asymptotic setting) still has not been achieved. In this work, we take a step towards characterizing the \emph{exact}-optimal approach in the presence of shared randomness (a random variable shared between the server and the user) and identify several conditions for \emph{exact} optimality. We prove that one of the conditions is to utilize a rotationally symmetric shared random codebook. Based on this, we propose a randomization mechanism where the codebook is a randomly rotated simplex -- satisfying the properties of the \emph{exact}-optimal codebook. The proposed mechanism is based on a $k$-closest encoding which we prove to be \emph{exact}-optimal for the randomly rotated simplex codebook.
△ Less
Submitted 28 October, 2023; v1 submitted 8 June, 2023;
originally announced June 2023.
-
Toward Textual Transform Coding
Authors:
Tsachy Weissman
Abstract:
Inspired by recent work on compression with and for young humans, the success of transform-based approaches to information processing, and the rise of powerful language-based AI, we propose \emph{textual transform coding}. It shares some of its key properties with traditional transform-based coding underlying much of our current multimedia compression technologies. It can form the basis for compre…
▽ More
Inspired by recent work on compression with and for young humans, the success of transform-based approaches to information processing, and the rise of powerful language-based AI, we propose \emph{textual transform coding}. It shares some of its key properties with traditional transform-based coding underlying much of our current multimedia compression technologies. It can form the basis for compression at bit rates until recently considered uselessly low, and for boosting human satisfaction from reconstructions at more traditional bit rates.
△ Less
Submitted 2 May, 2023;
originally announced May 2023.
-
PIM: Video Coding using Perceptual Importance Maps
Authors:
Evgenya Pergament,
Pulkit Tandon,
Oren Rippel,
Lubomir Bourdev,
Alexander G. Anderson,
Bruno Olshausen,
Tsachy Weissman,
Sachin Katti,
Kedar Tatwawadi
Abstract:
Human perception is at the core of lossy video compression, with numerous approaches developed for perceptual quality assessment and improvement over the past two decades. In the determination of perceptual quality, different spatio-temporal regions of the video differ in their relative importance to the human viewer. However, since it is challenging to infer or even collect such fine-grained info…
▽ More
Human perception is at the core of lossy video compression, with numerous approaches developed for perceptual quality assessment and improvement over the past two decades. In the determination of perceptual quality, different spatio-temporal regions of the video differ in their relative importance to the human viewer. However, since it is challenging to infer or even collect such fine-grained information, it is often not used during compression beyond low-level heuristics. We present a framework which facilitates research into fine-grained subjective importance in compressed videos, which we then utilize to improve the rate-distortion performance of an existing video codec (x264). The contributions of this work are threefold: (1) we introduce a web-tool which allows scalable collection of fine-grained perceptual importance, by having users interactively paint spatio-temporal maps over encoded videos; (2) we use this tool to collect a dataset with 178 videos with a total of 14443 frames of human annotated spatio-temporal importance maps over the videos; and (3) we use our curated dataset to train a lightweight machine learning model which can predict these spatio-temporal importance regions. We demonstrate via a subjective study that encoding the videos in our dataset while taking into account the importance maps leads to higher perceptual quality at the same bitrate, with the videos encoded with importance maps preferred $1.8 \times$ over the baseline videos. Similarly, we show that for the 18 videos in test set, the importance maps predicted by our model lead to higher perceptual quality videos, $2 \times$ preferred over the baseline at the same bitrate.
△ Less
Submitted 9 April, 2023; v1 submitted 20 December, 2022;
originally announced December 2022.
-
Leveraging the Hints: Adaptive Bidding in Repeated First-Price Auctions
Authors:
Wei Zhang,
Yanjun Han,
Zhengyuan Zhou,
Aaron Flores,
Tsachy Weissman
Abstract:
With the advent and increasing consolidation of e-commerce, digital advertising has very recently replaced traditional advertising as the main marketing force in the economy. In the past four years, a particularly important development in the digital advertising industry is the shift from second-price auctions to first-price auctions for online display ads. This shift immediately motivated the int…
▽ More
With the advent and increasing consolidation of e-commerce, digital advertising has very recently replaced traditional advertising as the main marketing force in the economy. In the past four years, a particularly important development in the digital advertising industry is the shift from second-price auctions to first-price auctions for online display ads. This shift immediately motivated the intellectually challenging question of how to bid in first-price auctions, because unlike in second-price auctions, bidding one's private value truthfully is no longer optimal. Following a series of recent works in this area, we consider a differentiated setup: we do not make any assumption about other bidders' maximum bid (i.e. it can be adversarial over time), and instead assume that we have access to a hint that serves as a prediction of other bidders' maximum bid, where the prediction is learned through some blackbox machine learning model. We consider two types of hints: one where a single point-prediction is available, and the other where a hint interval (representing a type of confidence region into which others' maximum bid falls) is available. We establish minimax optimal regret bounds for both cases and highlight the quantitatively different behavior between the two settings. We also provide improved regret bounds when the others' maximum bid exhibits the further structure of sparsity. Finally, we complement the theoretical results with demonstrations using real bidding data.
△ Less
Submitted 5 November, 2022;
originally announced November 2022.
-
Upper bounds on the Rate of Uniformly-Random Codes for the Deletion Channel
Authors:
Berivan Isik,
Francisco Pernice,
Tsachy Weissman
Abstract:
We consider the maximum coding rate achievable by uniformly-random codes for the deletion channel. We prove an upper bound that's within 0.1 of the best known lower bounds for all values of the deletion probability $d,$ and much closer for small and large $d.$ We give simulation results which suggest that our upper bound is within 0.05 of the exact value for all $d$, and within $0.01$ for…
▽ More
We consider the maximum coding rate achievable by uniformly-random codes for the deletion channel. We prove an upper bound that's within 0.1 of the best known lower bounds for all values of the deletion probability $d,$ and much closer for small and large $d.$ We give simulation results which suggest that our upper bound is within 0.05 of the exact value for all $d$, and within $0.01$ for $d>0.75$. Despite our upper bounds, based on simulations, we conjecture that a positive rate is achievable with uniformly-random codes for all deletion probabilities less than 1. Our results imply impossibility results for the (equivalent) problem of compression of i.i.d. sources correlated via the deletion channel, a relevant model for DNA storage.
△ Less
Submitted 13 October, 2022;
originally announced October 2022.
-
Sparse Random Networks for Communication-Efficient Federated Learning
Authors:
Berivan Isik,
Francesco Pase,
Deniz Gunduz,
Tsachy Weissman,
Michele Zorzi
Abstract:
One main challenge in federated learning is the large communication cost of exchanging weight updates from clients to the server at each round. While prior work has made great progress in compressing the weight updates through gradient compression methods, we propose a radically different approach that does not update the weights at all. Instead, our method freezes the weights at their initial \em…
▽ More
One main challenge in federated learning is the large communication cost of exchanging weight updates from clients to the server at each round. While prior work has made great progress in compressing the weight updates through gradient compression methods, we propose a radically different approach that does not update the weights at all. Instead, our method freezes the weights at their initial \emph{random} values and learns how to sparsify the random network for the best performance. To this end, the clients collaborate in training a \emph{stochastic} binary mask to find the optimal sparse random network within the original one. At the end of the training, the final model is a sparse network with random weights -- or a subnetwork inside the dense random network. We show improvements in accuracy, communication (less than $1$ bit per parameter (bpp)), convergence speed, and final model size (less than $1$ bpp) over relevant baselines on MNIST, EMNIST, CIFAR-10, and CIFAR-100 datasets, in the low bitrate regime under various system configurations.
△ Less
Submitted 8 February, 2023; v1 submitted 30 September, 2022;
originally announced September 2022.
-
An Interactive Annotation Tool for Perceptual Video Compression
Authors:
Evgenya Pergament,
Pulkit Tandon,
Kedar Tatwawadi,
Oren Rippel,
Lubomir Bourdev,
Bruno Olshausen,
Tsachy Weissman,
Sachin Katti,
Alexander G. Anderson
Abstract:
Human perception is at the core of lossy video compression and yet, it is challenging to collect data that is sufficiently dense to drive compression. In perceptual quality assessment, human feedback is typically collected as a single scalar quality score indicating preference of one distorted video over another. In reality, some videos may be better in some parts but not in others. We propose an…
▽ More
Human perception is at the core of lossy video compression and yet, it is challenging to collect data that is sufficiently dense to drive compression. In perceptual quality assessment, human feedback is typically collected as a single scalar quality score indicating preference of one distorted video over another. In reality, some videos may be better in some parts but not in others. We propose an approach to collecting finer-grained feedback by asking users to use an interactive tool to directly optimize for perceptual quality given a fixed bitrate. To this end, we built a novel web-tool which allows users to paint these spatio-temporal importance maps over videos. The tool allows for interactive successive refinement: we iteratively re-encode the original video according to the painted importance maps, while maintaining the same bitrate, thus allowing the user to visually see the trade-off of assigning higher importance to one spatio-temporal part of the video at the cost of others. We use this tool to collect data in-the-wild (10 videos, 17 users) and utilize the obtained importance maps in the context of x264 coding to demonstrate that the tool can indeed be used to generate videos which, at the same bitrate, look perceptually better through a subjective study - and are 1.9 times more likely to be preferred by viewers. The code for the tool and dataset can be found at https://github.com/jenyap/video-annotation-tool.git
△ Less
Submitted 8 May, 2022;
originally announced May 2022.
-
Lossy Compression of Noisy Data for Private and Data-Efficient Learning
Authors:
Berivan Isik,
Tsachy Weissman
Abstract:
Storage-efficient privacy-preserving learning is crucial due to increasing amounts of sensitive user data required for modern learning tasks. We propose a framework for reducing the storage cost of user data while at the same time providing privacy guarantees, without essential loss in the utility of the data for learning. Our method comprises noise injection followed by lossy compression. We show…
▽ More
Storage-efficient privacy-preserving learning is crucial due to increasing amounts of sensitive user data required for modern learning tasks. We propose a framework for reducing the storage cost of user data while at the same time providing privacy guarantees, without essential loss in the utility of the data for learning. Our method comprises noise injection followed by lossy compression. We show that, when appropriately matching the lossy compression to the distribution of the added noise, the compressed examples converge, in distribution, to that of the noise-free training data as the sample size of the training data (or the dimension of the training data) increases. In this sense, the utility of the data for learning is essentially maintained, while reducing storage and privacy leakage by quantifiable amounts. We present experimental results on the CelebA dataset for gender classification and find that our suggested pipeline delivers in practice on the promise of the theory: the individuals in the images are unrecognizable (or less recognizable, depending on the noise level), overall storage of the data is substantially reduced, with no essential loss (and in some cases a slight boost) to the classification accuracy. As an added bonus, our experiments suggest that our method yields a substantial boost to robustness in the face of adversarial test data.
△ Less
Submitted 22 March, 2023; v1 submitted 6 February, 2022;
originally announced February 2022.
-
Txt2Vid: Ultra-Low Bitrate Compression of Talking-Head Videos via Text
Authors:
Pulkit Tandon,
Shubham Chandak,
Pat Pataranutaporn,
Yimeng Liu,
Anesu M. Mapuranga,
Pattie Maes,
Tsachy Weissman,
Misha Sra
Abstract:
Video represents the majority of internet traffic today, driving a continual race between the generation of higher quality content, transmission of larger file sizes, and the development of network infrastructure. In addition, the recent COVID-19 pandemic fueled a surge in the use of video conferencing tools. Since videos take up considerable bandwidth (~100 Kbps to a few Mbps), improved video com…
▽ More
Video represents the majority of internet traffic today, driving a continual race between the generation of higher quality content, transmission of larger file sizes, and the development of network infrastructure. In addition, the recent COVID-19 pandemic fueled a surge in the use of video conferencing tools. Since videos take up considerable bandwidth (~100 Kbps to a few Mbps), improved video compression can have a substantial impact on network performance for live and pre-recorded content, providing broader access to multimedia content worldwide. We present a novel video compression pipeline, called Txt2Vid, which dramatically reduces data transmission rates by compressing webcam videos ("talking-head videos") to a text transcript. The text is transmitted and decoded into a realistic reconstruction of the original video using recent advances in deep learning based voice cloning and lip syncing models. Our generative pipeline achieves two to three orders of magnitude reduction in the bitrate as compared to the standard audio-video codecs (encoders-decoders), while maintaining equivalent Quality-of-Experience based on a subjective evaluation by users (n = 242) in an online study. The Txt2Vid framework opens up the potential for creating novel applications such as enabling audio-video communication during poor internet connectivity, or in remote terrains with limited bandwidth. The code for this work is available at https://github.com/tpulkit/txt2vid.git.
△ Less
Submitted 2 April, 2022; v1 submitted 26 June, 2021;
originally announced June 2021.
-
An Information-Theoretic Justification for Model Pruning
Authors:
Berivan Isik,
Tsachy Weissman,
Albert No
Abstract:
We study the neural network (NN) compression problem, viewing the tension between the compression ratio and NN performance through the lens of rate-distortion theory. We choose a distortion metric that reflects the effect of NN compression on the model output and derive the tradeoff between rate (compression) and distortion. In addition to characterizing theoretical limits of NN compression, this…
▽ More
We study the neural network (NN) compression problem, viewing the tension between the compression ratio and NN performance through the lens of rate-distortion theory. We choose a distortion metric that reflects the effect of NN compression on the model output and derive the tradeoff between rate (compression) and distortion. In addition to characterizing theoretical limits of NN compression, this formulation shows that \emph{pruning}, implicitly or explicitly, must be a part of a good compression algorithm. This observation bridges a gap between parts of the literature pertaining to NN and data compression, respectively, providing insight into the empirical success of model pruning. Finally, we propose a novel pruning strategy derived from our information-theoretic formulation and show that it outperforms the relevant baselines on CIFAR-10 and ImageNet datasets.
△ Less
Submitted 9 February, 2022; v1 submitted 16 February, 2021;
originally announced February 2021.
-
Neural Network Compression for Noisy Storage Devices
Authors:
Berivan Isik,
Kristy Choi,
Xin Zheng,
Tsachy Weissman,
Stefano Ermon,
H. -S. Philip Wong,
Armin Alaghi
Abstract:
Compression and efficient storage of neural network (NN) parameters is critical for applications that run on resource-constrained devices. Despite the significant progress in NN model compression, there has been considerably less investigation in the actual \textit{physical} storage of NN parameters. Conventionally, model compression and physical storage are decoupled, as digital storage media wit…
▽ More
Compression and efficient storage of neural network (NN) parameters is critical for applications that run on resource-constrained devices. Despite the significant progress in NN model compression, there has been considerably less investigation in the actual \textit{physical} storage of NN parameters. Conventionally, model compression and physical storage are decoupled, as digital storage media with error-correcting codes (ECCs) provide robust error-free storage. However, this decoupled approach is inefficient as it ignores the overparameterization present in most NNs and forces the memory device to allocate the same amount of resources to every bit of information regardless of its importance. In this work, we investigate analog memory devices as an alternative to digital media -- one that naturally provides a way to add more protection for significant bits unlike its counterpart, but is noisy and may compromise the stored model's performance if used naively. We develop a variety of robust coding strategies for NN weight storage on analog devices, and propose an approach to jointly optimize model compression and memory resource allocation. We then demonstrate the efficacy of our approach on models trained on MNIST, CIFAR-10 and ImageNet datasets for existing compression techniques. Compared to conventional error-free digital storage, our method reduces the memory footprint by up to one order of magnitude, without significantly compromising the stored model's accuracy.
△ Less
Submitted 13 March, 2023; v1 submitted 15 February, 2021;
originally announced February 2021.
-
Reducing latency and bandwidth for video streaming using keypoint extraction and digital puppetry
Authors:
Roshan Prabhakar,
Shubham Chandak,
Carina Chiu,
Renee Liang,
Huong Nguyen,
Kedar Tatwawadi,
Tsachy Weissman
Abstract:
COVID-19 has made video communication one of the most important modes of information exchange. While extensive research has been conducted on the optimization of the video streaming pipeline, in particular the development of novel video codecs, further improvement in the video quality and latency is required, especially under poor network conditions. This paper proposes an alternative to the conve…
▽ More
COVID-19 has made video communication one of the most important modes of information exchange. While extensive research has been conducted on the optimization of the video streaming pipeline, in particular the development of novel video codecs, further improvement in the video quality and latency is required, especially under poor network conditions. This paper proposes an alternative to the conventional codec through the implementation of a keypoint-centric encoder relying on the transmission of keypoint information from within a video feed. The decoder uses the streamed keypoints to generate a reconstruction preserving the semantic features in the input feed. Focusing on video calling applications, we detect and transmit the body pose and face mesh information through the network, which are displayed at the receiver in the form of animated puppets. Using efficient pose and face mesh detection in conjunction with skeleton-based animation, we demonstrate a prototype requiring lower than 35 kbps bandwidth, an order of magnitude reduction over typical video calling systems. The added computational latency due to the mesh extraction and animation is below 120ms on a standard laptop, showcasing the potential of this framework for real-time applications. The code for this work is available at https://github.com/shubhamchandak94/digital-puppetry/.
△ Less
Submitted 8 January, 2021; v1 submitted 7 November, 2020;
originally announced November 2020.
-
Learning to Bid Optimally and Efficiently in Adversarial First-price Auctions
Authors:
Yanjun Han,
Zhengyuan Zhou,
Aaron Flores,
Erik Ordentlich,
Tsachy Weissman
Abstract:
First-price auctions have very recently swept the online advertising industry, replacing second-price auctions as the predominant auction mechanism on many platforms. This shift has brought forth important challenges for a bidder: how should one bid in a first-price auction, where unlike in second-price auctions, it is no longer optimal to bid one's private value truthfully and hard to know the ot…
▽ More
First-price auctions have very recently swept the online advertising industry, replacing second-price auctions as the predominant auction mechanism on many platforms. This shift has brought forth important challenges for a bidder: how should one bid in a first-price auction, where unlike in second-price auctions, it is no longer optimal to bid one's private value truthfully and hard to know the others' bidding behaviors? In this paper, we take an online learning angle and address the fundamental problem of learning to bid in repeated first-price auctions, where both the bidder's private valuations and other bidders' bids can be arbitrary. We develop the first minimax optimal online bidding algorithm that achieves an $\widetilde{O}(\sqrt{T})$ regret when competing with the set of all Lipschitz bidding policies, a strong oracle that contains a rich set of bidding strategies. This novel algorithm is built on the insight that the presence of a good expert can be leveraged to improve performance, as well as an original hierarchical expert-chaining structure, both of which could be of independent interest in online learning. Further, by exploiting the product structure that exists in the problem, we modify this algorithm--in its vanilla form statistically optimal but computationally infeasible--to a computationally efficient and space efficient algorithm that also retains the same $\widetilde{O}(\sqrt{T})$ minimax optimal regret guarantee. Additionally, through an impossibility result, we highlight that one is unlikely to compete this favorably with a stronger oracle (than the considered Lipschitz bidding policies). Finally, we test our algorithm on three real-world first-price auction datasets obtained from Verizon Media and demonstrate our algorithm's superior performance compared to several existing bidding algorithms.
△ Less
Submitted 9 July, 2020;
originally announced July 2020.
-
Optimal No-regret Learning in Repeated First-price Auctions
Authors:
Yanjun Han,
Zhengyuan Zhou,
Tsachy Weissman
Abstract:
We study online learning in repeated first-price auctions where a bidder, only observing the winning bid at the end of each auction, learns to adaptively bid in order to maximize her cumulative payoff. To achieve this goal, the bidder faces censored feedback: if she wins the bid, then she is not able to observe the highest bid of the other bidders, which we assume is \textit{iid} drawn from an unk…
▽ More
We study online learning in repeated first-price auctions where a bidder, only observing the winning bid at the end of each auction, learns to adaptively bid in order to maximize her cumulative payoff. To achieve this goal, the bidder faces censored feedback: if she wins the bid, then she is not able to observe the highest bid of the other bidders, which we assume is \textit{iid} drawn from an unknown distribution. In this paper, we develop the first learning algorithm that achieves a near-optimal $\widetilde{O}(\sqrt{T})$ regret bound, by exploiting two structural properties of first-price auctions, i.e. the specific feedback structure and payoff function.
We first formulate the feedback structure in first-price auctions as partially ordered contextual bandits, a combination of the graph feedback across actions (bids), the cross learning across contexts (private values), and a partial order over the contexts. We establish both strengths and weaknesses of this framework, by showing a curious separation that a regret nearly independent of the action/context sizes is possible under stochastic contexts, but is impossible under adversarial contexts. In particular, this framework leads to an $O(\sqrt{T}\log^{2.5}T)$ regret for first-price auctions when the bidder's private values are \emph{iid}.
Despite the limitation of the above framework, we further exploit the special payoff function of first-price auctions to develop a sample-efficient algorithm even in the presence of adversarially generated private values. We establish an $O(\sqrt{T}\log^3 T)$ regret bound for this algorithm, hence providing a complete characterization of optimal learning guarantees for first-price auctions.
△ Less
Submitted 4 March, 2024; v1 submitted 21 March, 2020;
originally announced March 2020.
-
LFZip: Lossy compression of multivariate floating-point time series data via improved prediction
Authors:
Shubham Chandak,
Kedar Tatwawadi,
Chengtao Wen,
Lingyun Wang,
Juan Aparicio,
Tsachy Weissman
Abstract:
Time series data compression is emerging as an important problem with the growth in IoT devices and sensors. Due to the presence of noise in these datasets, lossy compression can often provide significant compression gains without impacting the performance of downstream applications. In this work, we propose an error-bounded lossy compressor, LFZip, for multivariate floating-point time series data…
▽ More
Time series data compression is emerging as an important problem with the growth in IoT devices and sensors. Due to the presence of noise in these datasets, lossy compression can often provide significant compression gains without impacting the performance of downstream applications. In this work, we propose an error-bounded lossy compressor, LFZip, for multivariate floating-point time series data that provides guaranteed reconstruction up to user-specified maximum absolute error. The compressor is based on the prediction-quantization-entropy coder framework and benefits from improved prediction using linear models and neural networks. We evaluate the compressor on several time series datasets where it outperforms the existing state-of-the-art error-bounded lossy compressors. The code and data are available at https://github.com/shubhamchandak94/LFZip
△ Less
Submitted 13 January, 2020; v1 submitted 1 November, 2019;
originally announced November 2019.
-
Minimum Power to Maintain a Nonequilibrium Distribution of a Markov Chain
Authors:
Dmitri S. Pavlichin,
Yihui Quek,
Tsachy Weissman
Abstract:
Biological systems use energy to maintain non-equilibrium distributions for long times, e.g. of chemical concentrations or protein conformations. What are the fundamental limits of the power used to "hold" a stochastic system in a desired distribution over states? We study the setting of an uncontrolled Markov chain $Q$ altered into a controlled chain $P$ having a desired stationary distribution.…
▽ More
Biological systems use energy to maintain non-equilibrium distributions for long times, e.g. of chemical concentrations or protein conformations. What are the fundamental limits of the power used to "hold" a stochastic system in a desired distribution over states? We study the setting of an uncontrolled Markov chain $Q$ altered into a controlled chain $P$ having a desired stationary distribution. Thermodynamics considerations lead to an appropriately defined Kullback-Leibler (KL) divergence rate $D(P||Q)$ as the cost of control, a setting introduced by Todorov, corresponding to a Markov decision process with mean log loss action cost.
The optimal controlled chain $P^*$ minimizes the KL divergence rate $D(\cdot||Q)$ subject to a stationary distribution constraint, and the minimal KL divergence rate lower bounds the power used. While this optimization problem is familiar from the large deviations literature, we offer a novel interpretation as a minimum "holding cost" and compute the minimizer $P^*$ more explicitly than previously available. We state a version of our results for both discrete- and continuous-time Markov chains, and find nice expressions for the important case of a reversible uncontrolled chain $Q$, for a two-state chain, and for birth-and-death processes.
△ Less
Submitted 2 July, 2019;
originally announced July 2019.
-
Optimal Communication Rates and Combinatorial Properties for Common Randomness Generation
Authors:
Yanjun Han,
Kedar Tatwawadi,
Gowtham R. Kurri,
Zhengqing Zhou,
Vinod M. Prabhakaran,
Tsachy Weissman
Abstract:
We study common randomness generation problems where $n$ players aim to generate same sequences of random coin flips where some subsets of the players share an independent common coin which can be tossed multiple times, and there is a publicly seen blackboard through which the players communicate with each other. We provide a tight representation of the optimal communication rates via linear progr…
▽ More
We study common randomness generation problems where $n$ players aim to generate same sequences of random coin flips where some subsets of the players share an independent common coin which can be tossed multiple times, and there is a publicly seen blackboard through which the players communicate with each other. We provide a tight representation of the optimal communication rates via linear programming, and more importantly, propose explicit algorithms for the optimal distributed simulation for a wide class of hypergraphs. In particular, the optimal communication rate in complete hypergraphs is still achievable in sparser hypergraphs containing a path-connected cycle-free cluster of topologically connected components. Some key steps in analyzing the upper bounds rely on two different definitions of connectivity in hypergraphs, which may be of independent interest.
△ Less
Submitted 6 October, 2021; v1 submitted 5 April, 2019;
originally announced April 2019.
-
Neural Joint Source-Channel Coding
Authors:
Kristy Choi,
Kedar Tatwawadi,
Aditya Grover,
Tsachy Weissman,
Stefano Ermon
Abstract:
For reliable transmission across a noisy communication channel, classical results from information theory show that it is asymptotically optimal to separate out the source and channel coding processes. However, this decomposition can fall short in the finite bit-length regime, as it requires non-trivial tuning of hand-crafted codes and assumes infinite computational power for decoding. In this wor…
▽ More
For reliable transmission across a noisy communication channel, classical results from information theory show that it is asymptotically optimal to separate out the source and channel coding processes. However, this decomposition can fall short in the finite bit-length regime, as it requires non-trivial tuning of hand-crafted codes and assumes infinite computational power for decoding. In this work, we propose to jointly learn the encoding and decoding processes using a new discrete variational autoencoder model. By adding noise into the latent codes to simulate the channel during training, we learn to both compress and error-correct given a fixed bit-length and computational budget. We obtain codes that are not only competitive against several separation schemes, but also learn useful robust representations of the data for downstream tasks such as classification. Finally, inference amortization yields an extremely fast neural decoder, almost an order of magnitude faster compared to standard decoding methods based on iterative belief propagation.
△ Less
Submitted 14 May, 2019; v1 submitted 19 November, 2018;
originally announced November 2018.
-
Towards improved lossy image compression: Human image reconstruction with public-domain images
Authors:
Ashutosh Bhown,
Soham Mukherjee,
Sean Yang,
Shubham Chandak,
Irena Fischer-Hwang,
Kedar Tatwawadi,
Judith Fan,
Tsachy Weissman
Abstract:
Lossy image compression has been studied extensively in the context of typical loss functions such as RMSE, MS-SSIM, etc. However, compression at low bitrates generally produces unsatisfying results. Furthermore, the availability of massive public image datasets appears to have hardly been exploited in image compression. Here, we present a paradigm for eliciting human image reconstruction in order…
▽ More
Lossy image compression has been studied extensively in the context of typical loss functions such as RMSE, MS-SSIM, etc. However, compression at low bitrates generally produces unsatisfying results. Furthermore, the availability of massive public image datasets appears to have hardly been exploited in image compression. Here, we present a paradigm for eliciting human image reconstruction in order to perform lossy image compression. In this paradigm, one human describes images to a second human, whose task is to reconstruct the target image using publicly available images and text instructions. The resulting reconstructions are then evaluated by human raters on the Amazon Mechanical Turk platform and compared to reconstructions obtained using state-of-the-art compressor WebP. Our results suggest that prioritizing semantic visual elements may be key to achieving significant improvements in image compression, and that our paradigm can be used to develop a more human-centric loss function.
The images, results and additional data are available at https://compression.stanford.edu/human-compression
△ Less
Submitted 24 June, 2019; v1 submitted 25 October, 2018;
originally announced October 2018.
-
Concentration Inequalities for the Empirical Distribution
Authors:
Jay Mardia,
Jiantao Jiao,
Ervin Tánczos,
Robert D. Nowak,
Tsachy Weissman
Abstract:
We study concentration inequalities for the Kullback--Leibler (KL) divergence between the empirical distribution and the true distribution. Applying a recursion technique, we improve over the method of types bound uniformly in all regimes of sample size $n$ and alphabet size $k$, and the improvement becomes more significant when $k$ is large. We discuss the applications of our results in obtaining…
▽ More
We study concentration inequalities for the Kullback--Leibler (KL) divergence between the empirical distribution and the true distribution. Applying a recursion technique, we improve over the method of types bound uniformly in all regimes of sample size $n$ and alphabet size $k$, and the improvement becomes more significant when $k$ is large. We discuss the applications of our results in obtaining tighter concentration inequalities for $L_1$ deviations of the empirical distribution from the true distribution, and the difference between concentration around the expectation or zero. We also obtain asymptotically tight bounds on the variance of the KL divergence between the empirical and true distribution, and demonstrate their quantitatively different behaviors between small and large sample sizes compared to the alphabet size.
△ Less
Submitted 18 October, 2019; v1 submitted 18 September, 2018;
originally announced September 2018.
-
Minimax redundancy for Markov chains with large state space
Authors:
Kedar Shriram Tatwawadi,
Jiantao Jiao,
Tsachy Weissman
Abstract:
For any Markov source, there exist universal codes whose normalized codelength approaches the Shannon limit asymptotically as the number of samples goes to infinity. This paper investigates how fast the gap between the normalized codelength of the "best" universal compressor and the Shannon limit (i.e. the compression redundancy) vanishes non-asymptotically in terms of the alphabet size and mixing…
▽ More
For any Markov source, there exist universal codes whose normalized codelength approaches the Shannon limit asymptotically as the number of samples goes to infinity. This paper investigates how fast the gap between the normalized codelength of the "best" universal compressor and the Shannon limit (i.e. the compression redundancy) vanishes non-asymptotically in terms of the alphabet size and mixing time of the Markov source. We show that, for Markov sources whose relaxation time is at least $1 + \frac{(2+c)}{\sqrt{k}}$, where $k$ is the state space size (and $c>0$ is a constant), the phase transition for the number of samples required to achieve vanishing compression redundancy is precisely $Θ(k^2)$.
△ Less
Submitted 5 May, 2018; v1 submitted 1 May, 2018;
originally announced May 2018.
-
Geometric Lower Bounds for Distributed Parameter Estimation under Communication Constraints
Authors:
Yanjun Han,
Ayfer Özgür,
Tsachy Weissman
Abstract:
We consider parameter estimation in distributed networks, where each sensor in the network observes an independent sample from an underlying distribution and has $k$ bits to communicate its sample to a centralized processor which computes an estimate of a desired parameter. We develop lower bounds for the minimax risk of estimating the underlying parameter for a large class of losses and distribut…
▽ More
We consider parameter estimation in distributed networks, where each sensor in the network observes an independent sample from an underlying distribution and has $k$ bits to communicate its sample to a centralized processor which computes an estimate of a desired parameter. We develop lower bounds for the minimax risk of estimating the underlying parameter for a large class of losses and distributions. Our results show that under mild regularity conditions, the communication constraint reduces the effective sample size by a factor of $d$ when $k$ is small, where $d$ is the dimension of the estimated parameter. Furthermore, this penalty reduces at most exponentially with increasing $k$, which is the case for some models, e.g., estimating high-dimensional distributions. For other models however, we show that the sample size reduction is re-mediated only linearly with increasing $k$, e.g. when some sub-Gaussian structure is available. We apply our results to the distributed setting with product Bernoulli model, multinomial model, Gaussian location models, and logistic regression which recover or strengthen existing results.
Our approach significantly deviates from existing approaches for develo** information-theoretic lower bounds for communication-efficient estimation. We circumvent the need for strong data processing inequalities used in prior work and develop a geometric approach which builds on a new representation of the communication constraint. This approach allows us to strengthen and generalize existing results with simpler and more transparent proofs.
△ Less
Submitted 22 July, 2021; v1 submitted 23 February, 2018;
originally announced February 2018.
-
Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under Wasserstein distance
Authors:
Yanjun Han,
Jiantao Jiao,
Tsachy Weissman
Abstract:
We present \emph{Local Moment Matching (LMM)}, a unified methodology for symmetric functional estimation and distribution estimation under Wasserstein distance. We construct an efficiently computable estimator that achieves the minimax rates in estimating the distribution up to permutation, and show that the plug-in approach of our unlabeled distribution estimator is "universal" in estimating symm…
▽ More
We present \emph{Local Moment Matching (LMM)}, a unified methodology for symmetric functional estimation and distribution estimation under Wasserstein distance. We construct an efficiently computable estimator that achieves the minimax rates in estimating the distribution up to permutation, and show that the plug-in approach of our unlabeled distribution estimator is "universal" in estimating symmetric functionals of discrete distributions. Instead of doing best polynomial approximation explicitly as in existing literature of functional estimation, the plug-in approach conducts polynomial approximation implicitly and attains the optimal sample complexity for the entropy, power sum and support size functionals.
△ Less
Submitted 26 June, 2018; v1 submitted 23 February, 2018;
originally announced February 2018.
-
Entropy Rate Estimation for Markov Chains with Large State Space
Authors:
Yanjun Han,
Jiantao Jiao,
Chuan-Zheng Lee,
Tsachy Weissman,
Yihong Wu,
Tiancheng Yu
Abstract:
Estimating the entropy based on data is one of the prototypical problems in distribution property testing and estimation. For estimating the Shannon entropy of a distribution on $S$ elements with independent samples, [Paninski2004] showed that the sample complexity is sublinear in $S$, and [Valiant--Valiant2011] showed that consistent estimation of Shannon entropy is possible if and only if the sa…
▽ More
Estimating the entropy based on data is one of the prototypical problems in distribution property testing and estimation. For estimating the Shannon entropy of a distribution on $S$ elements with independent samples, [Paninski2004] showed that the sample complexity is sublinear in $S$, and [Valiant--Valiant2011] showed that consistent estimation of Shannon entropy is possible if and only if the sample size $n$ far exceeds $\frac{S}{\log S}$. In this paper we consider the problem of estimating the entropy rate of a stationary reversible Markov chain with $S$ states from a sample path of $n$ observations. We show that:
(1) As long as the Markov chain mixes not too slowly, i.e., the relaxation time is at most $O(\frac{S}{\ln^3 S})$, consistent estimation is achievable when $n \gg \frac{S^2}{\log S}$.
(2) As long as the Markov chain has some slight dependency, i.e., the relaxation time is at least $1+Ω(\frac{\ln^2 S}{\sqrt{S}})$, consistent estimation is impossible when $n \lesssim \frac{S^2}{\log S}$.
Under both assumptions, the optimal estimation accuracy is shown to be $Θ(\frac{S^2}{n \log S})$. In comparison, the empirical entropy rate requires at least $Ω(S^2)$ samples to be consistent, even when the Markov chain is memoryless. In addition to synthetic experiments, we also apply the estimators that achieve the optimal sample complexity to estimate the entropy rate of the English language in the Penn Treebank and the Google One Billion Words corpora, which provides a natural benchmark for language modeling and relates it directly to the widely used perplexity measure.
△ Less
Submitted 24 September, 2018; v1 submitted 21 February, 2018;
originally announced February 2018.
-
Approximate Profile Maximum Likelihood
Authors:
Dmitri S. Pavlichin,
Jiantao Jiao,
Tsachy Weissman
Abstract:
We propose an efficient algorithm for approximate computation of the profile maximum likelihood (PML), a variant of maximum likelihood maximizing the probability of observing a sufficient statistic rather than the empirical sample. The PML has appealing theoretical properties, but is difficult to compute exactly. Inspired by observations gleaned from exactly solvable cases, we look for an approxim…
▽ More
We propose an efficient algorithm for approximate computation of the profile maximum likelihood (PML), a variant of maximum likelihood maximizing the probability of observing a sufficient statistic rather than the empirical sample. The PML has appealing theoretical properties, but is difficult to compute exactly. Inspired by observations gleaned from exactly solvable cases, we look for an approximate PML solution, which, intuitively, clumps comparably frequent symbols into one symbol. This amounts to lower-bounding a certain matrix permanent by summing over a subgroup of the symmetric group rather than the whole group during the computation. We extensively experiment with the approximate solution, and find the empirical performance of our approach is competitive and sometimes significantly better than state-of-the-art performance for various estimation problems.
△ Less
Submitted 19 December, 2017;
originally announced December 2017.
-
Optimal rates of entropy estimation over Lipschitz balls
Authors:
Yanjun Han,
Jiantao Jiao,
Tsachy Weissman,
Yihong Wu
Abstract:
We consider the problem of minimax estimation of the entropy of a density over Lipschitz balls. Drop** the usual assumption that the density is bounded away from zero, we obtain the minimax rates $(n\ln n)^{-s/(s+d)} + n^{-1/2}$ for $0<s\leq 2$ for densities supported on $[0,1]^d$, where $s$ is the smoothness parameter and $n$ is the number of independent samples. We generalize the results to de…
▽ More
We consider the problem of minimax estimation of the entropy of a density over Lipschitz balls. Drop** the usual assumption that the density is bounded away from zero, we obtain the minimax rates $(n\ln n)^{-s/(s+d)} + n^{-1/2}$ for $0<s\leq 2$ for densities supported on $[0,1]^d$, where $s$ is the smoothness parameter and $n$ is the number of independent samples. We generalize the results to densities with unbounded support: given an Orlicz functions $Ψ$ of rapid growth (such as the sub-exponential and sub-Gaussian classes), the minimax rates for densities with bounded $Ψ$-Orlicz norm increase to $(n\ln n)^{-s/(s+d)} (Ψ^{-1}(n))^{d(1-d/p(s+d))} + n^{-1/2}$, where $p$ is the norm parameter in the Lipschitz ball. We also show that the integral-form plug-in estimators with kernel density estimates fail to achieve the minimax rates, and characterize their worst case performances over the Lipschitz ball.
One of the key steps in analyzing the bias relies on a novel application of the Hardy-Littlewood maximal inequality, which also leads to a new inequality on the Fisher information that may be of independent interest.
△ Less
Submitted 10 November, 2019; v1 submitted 6 November, 2017;
originally announced November 2017.
-
Universality of Logarithmic Loss in Lossy Compression
Authors:
Albert No,
Tsachy Weissman
Abstract:
We establish two strong senses of universality of logarithmic loss as a distortion criterion in lossy compression: For any fixed length lossy compression problem under an arbitrary distortion criterion, we show that there is an equivalent lossy compression problem under logarithmic loss. In the successive refinement problem, if the first decoder operates under logarithmic loss, we show that any di…
▽ More
We establish two strong senses of universality of logarithmic loss as a distortion criterion in lossy compression: For any fixed length lossy compression problem under an arbitrary distortion criterion, we show that there is an equivalent lossy compression problem under logarithmic loss. In the successive refinement problem, if the first decoder operates under logarithmic loss, we show that any discrete memoryless source is successively refinable under an arbitrary distortion criterion for the second decoder.
△ Less
Submitted 31 August, 2017;
originally announced September 2017.
-
Generalizations of Maximal Inequalities to Arbitrary Selection Rules
Authors:
Jiantao Jiao,
Yanjun Han,
Tsachy Weissman
Abstract:
We present a generalization of the maximal inequalities that upper bound the expectation of the maximum of $n$ jointly distributed random variables. We control the expectation of a randomly selected random variable from $n$ jointly distributed random variables, and present bounds that are at least as tight as the classical maximal inequalities, and much tighter when the distribution of selection i…
▽ More
We present a generalization of the maximal inequalities that upper bound the expectation of the maximum of $n$ jointly distributed random variables. We control the expectation of a randomly selected random variable from $n$ jointly distributed random variables, and present bounds that are at least as tight as the classical maximal inequalities, and much tighter when the distribution of selection index is near deterministic. A new family of information theoretic measures were introduced in the process, which may be of independent interest.
△ Less
Submitted 29 August, 2017;
originally announced August 2017.
-
Estimating the Fundamental Limits is Easier than Achieving the Fundamental Limits
Authors:
Jiantao Jiao,
Yanjun Han,
Irena Fischer-Hwang,
Tsachy Weissman
Abstract:
We show through case studies that it is easier to estimate the fundamental limits of data processing than to construct explicit algorithms to achieve those limits. Focusing on binary classification, data compression, and prediction under logarithmic loss, we show that in the finite space setting, when it is possible to construct an estimator of the limits with vanishing error with $n$ samples, it…
▽ More
We show through case studies that it is easier to estimate the fundamental limits of data processing than to construct explicit algorithms to achieve those limits. Focusing on binary classification, data compression, and prediction under logarithmic loss, we show that in the finite space setting, when it is possible to construct an estimator of the limits with vanishing error with $n$ samples, it may require at least $n\ln n$ samples to construct an explicit algorithm to achieve the limits.
△ Less
Submitted 1 October, 2017; v1 submitted 4 July, 2017;
originally announced July 2017.
-
Minimax Estimation of the $L_1$ Distance
Authors:
Jiantao Jiao,
Yanjun Han,
Tsachy Weissman
Abstract:
We consider the problem of estimating the $L_1$ distance between two discrete probability measures $P$ and $Q$ from empirical data in a nonasymptotic and large alphabet setting. When $Q$ is known and one obtains $n$ samples from $P$, we show that for every $Q$, the minimax rate-optimal estimator with $n$ samples achieves performance comparable to that of the maximum likelihood estimator (MLE) with…
▽ More
We consider the problem of estimating the $L_1$ distance between two discrete probability measures $P$ and $Q$ from empirical data in a nonasymptotic and large alphabet setting. When $Q$ is known and one obtains $n$ samples from $P$, we show that for every $Q$, the minimax rate-optimal estimator with $n$ samples achieves performance comparable to that of the maximum likelihood estimator (MLE) with $n\ln n$ samples. When both $P$ and $Q$ are unknown, we construct minimax rate-optimal estimators whose worst case performance is essentially that of the known $Q$ case with $Q$ being uniform, implying that $Q$ being uniform is essentially the most difficult case. The \emph{effective sample size enlargement} phenomenon, identified in Jiao \emph{et al.} (2015), holds both in the known $Q$ case for every $Q$ and the $Q$ unknown case. However, the construction of optimal estimators for $\|P-Q\|_1$ requires new techniques and insights beyond the approximation-based method of functional estimation in Jiao \emph{et al.} (2015).
△ Less
Submitted 23 June, 2018; v1 submitted 2 May, 2017;
originally announced May 2017.
-
Mutual Information, Relative Entropy and Estimation Error in Semi-martingale Channels
Authors:
Jiantao Jiao,
Kartik Venkat,
Tsachy Weissman
Abstract:
Fundamental relations between information and estimation have been established in the literature for the continuous-time Gaussian and Poisson channels, in a long line of work starting from the classical representation theorems by Duncan and Kabanov respectively. In this work, we demonstrate that such relations hold for a much larger family of continuous-time channels. We introduce the family of se…
▽ More
Fundamental relations between information and estimation have been established in the literature for the continuous-time Gaussian and Poisson channels, in a long line of work starting from the classical representation theorems by Duncan and Kabanov respectively. In this work, we demonstrate that such relations hold for a much larger family of continuous-time channels. We introduce the family of semi-martingale channels where the channel output is a semi-martingale stochastic process, and the channel input modulates the characteristics of the semi-martingale. For these channels, which includes as a special case the continuous time Gaussian and Poisson models, we establish new representations relating the mutual information between the channel input and output to an optimal causal filtering loss, thereby unifying and considerably extending results from the Gaussian and Poisson settings. Extensions to the setting of mismatched estimation are also presented where the relative entropy between the laws governing the output of the channel under two different input distributions is equal to the cumulative difference between the estimation loss incurred by using the mismatched and optimal causal filters respectively. The main tool underlying these results is the Doob--Meyer decomposition of a class of likelihood ratio sub-martingales. The results in this work can be viewed as the continuous-time analogues of recent generalizations for relations between information and estimation for discrete-time Lévy channels.
△ Less
Submitted 18 April, 2017;
originally announced April 2017.
-
Dependence Measures Bounding the Exploration Bias for General Measurements
Authors:
Jiantao Jiao,
Yanjun Han,
Tsachy Weissman
Abstract:
We propose a framework to analyze and quantify the bias in adaptive data analysis. It generalizes that proposed by Russo and Zou'15, applying to measurements whose moment generating function exists, measurements with a finite $p$-norm, and measurements in general Orlicz spaces. We introduce a new class of dependence measures which retain key properties of mutual information while more effectively…
▽ More
We propose a framework to analyze and quantify the bias in adaptive data analysis. It generalizes that proposed by Russo and Zou'15, applying to measurements whose moment generating function exists, measurements with a finite $p$-norm, and measurements in general Orlicz spaces. We introduce a new class of dependence measures which retain key properties of mutual information while more effectively quantifying the exploration bias for heavy tailed distributions. We provide examples of cases where our bounds are nearly tight in situations where the original framework of Russo and Zou'15 does not apply.
△ Less
Submitted 17 July, 2017; v1 submitted 17 December, 2016;
originally announced December 2016.
-
Demystifying ResNet
Authors:
Sihan Li,
Jiantao Jiao,
Yanjun Han,
Tsachy Weissman
Abstract:
The Residual Network (ResNet), proposed in He et al. (2015), utilized shortcut connections to significantly reduce the difficulty of training, which resulted in great performance boosts in terms of both training and generalization error.
It was empirically observed in He et al. (2015) that stacking more layers of residual blocks with shortcut 2 results in smaller training error, while it is not…
▽ More
The Residual Network (ResNet), proposed in He et al. (2015), utilized shortcut connections to significantly reduce the difficulty of training, which resulted in great performance boosts in terms of both training and generalization error.
It was empirically observed in He et al. (2015) that stacking more layers of residual blocks with shortcut 2 results in smaller training error, while it is not true for shortcut of length 1 or 3. We provide a theoretical explanation for the uniqueness of shortcut 2.
We show that with or without nonlinearities, by adding shortcuts that have depth two, the condition number of the Hessian of the loss function at the zero initial point is depth-invariant, which makes training very deep models no more difficult than shallow ones. Shortcuts of higher depth result in an extremely flat (high-order) stationary point initially, from which the optimization algorithm is hard to escape. The shortcut 1, however, is essentially equivalent to no shortcuts, which has a condition number exploding to infinity as the number of layers grows. We further argue that as the number of layers tends to infinity, it suffices to only look at the loss function at the zero initial point.
Extensive experiments are provided accompanying our theoretical results. We show that initializing the network to small weights with shortcut 2 achieves significantly better results than random Gaussian (Xavier) initialization, orthogonal initialization, and shortcuts of deeper depth, from various perspectives ranging from final loss, learning dynamics and stability, to the behavior of the Hessian along the learning process.
△ Less
Submitted 20 May, 2017; v1 submitted 3 November, 2016;
originally announced November 2016.
-
When is Noisy State Information at the Encoder as Useless as No Information or as Good as Noise-Free State?
Authors:
Rui Xu,
Jun Chen,
Tsachy Weissman,
Jian-Kang Zhang
Abstract:
For any binary-input channel with perfect state information at the decoder, if the mutual information between the noisy state observation at the encoder and the true channel state is below a positive threshold determined solely by the state distribution, then the capacity is the same as that with no encoder side information. A complementary phenomenon is revealed for the generalized probing capaci…
▽ More
For any binary-input channel with perfect state information at the decoder, if the mutual information between the noisy state observation at the encoder and the true channel state is below a positive threshold determined solely by the state distribution, then the capacity is the same as that with no encoder side information. A complementary phenomenon is revealed for the generalized probing capacity. Extensions beyond binary-input channels are developed.
△ Less
Submitted 1 November, 2016;
originally announced November 2016.
-
Minimax Rate-Optimal Estimation of Divergences between Discrete Distributions
Authors:
Yanjun Han,
Jiantao Jiao,
Tsachy Weissman
Abstract:
We study the minimax estimation of $α$-divergences between discrete distributions for integer $α\ge 1$, which include the Kullback--Leibler divergence and the $χ^2$-divergences as special examples. Drop** the usual theoretical tricks to acquire independence, we construct the first minimax rate-optimal estimator which does not require any Poissonization, sample splitting, or explicit construction…
▽ More
We study the minimax estimation of $α$-divergences between discrete distributions for integer $α\ge 1$, which include the Kullback--Leibler divergence and the $χ^2$-divergences as special examples. Drop** the usual theoretical tricks to acquire independence, we construct the first minimax rate-optimal estimator which does not require any Poissonization, sample splitting, or explicit construction of approximating polynomials. The estimator uses a hybrid approach which solves a problem-independent linear program based on moment matching in the non-smooth regime, and applies a problem-dependent bias-corrected plug-in estimator in the smooth regime, with a soft decision boundary between these regimes.
△ Less
Submitted 3 March, 2021; v1 submitted 30 May, 2016;
originally announced May 2016.
-
DUDE-Seq: Fast, Flexible, and Robust Denoising for Targeted Amplicon Sequencing
Authors:
Byunghan Lee,
Taesup Moon,
Sungroh Yoon,
Tsachy Weissman
Abstract:
We consider the correction of errors from nucleotide sequences produced by next-generation targeted amplicon sequencing. The next-generation sequencing (NGS) platforms can provide a great deal of sequencing data thanks to their high throughput, but the associated error rates often tend to be high. Denoising in high-throughput sequencing has thus become a crucial process for boosting the reliabilit…
▽ More
We consider the correction of errors from nucleotide sequences produced by next-generation targeted amplicon sequencing. The next-generation sequencing (NGS) platforms can provide a great deal of sequencing data thanks to their high throughput, but the associated error rates often tend to be high. Denoising in high-throughput sequencing has thus become a crucial process for boosting the reliability of downstream analyses. Our methodology, named DUDE-Seq, is derived from a general setting of reconstructing finite-valued source data corrupted by a discrete memoryless channel and effectively corrects substitution and homopolymer indel errors, the two major types of sequencing errors in most high-throughput targeted amplicon sequencing platforms. Our experimental studies with real and simulated datasets suggest that the proposed DUDE-Seq not only outperforms existing alternatives in terms of error-correction capability and time efficiency, but also boosts the reliability of downstream analyses. Further, the flexibility of DUDE-Seq enables its robust application to different sequencing platforms and analysis pipelines by simple updates of the noise model. DUDE-Seq is available at http://data.snu.ac.kr/pub/dude-seq.
△ Less
Submitted 4 July, 2017; v1 submitted 16 November, 2015;
originally announced November 2015.
-
Strong Successive Refinability and Rate-Distortion-Complexity Tradeoff
Authors:
Albert No,
Amir Ingber,
Tsachy Weissman
Abstract:
We investigate the second order asymptotics (source dispersion) of the successive refinement problem. Similarly to the classical definition of a successively refinable source, we say that a source is strongly successively refinable if successive refinement coding can achieve the second order optimum rate (including the dispersion terms) at both decoders. We establish a sufficient condition for str…
▽ More
We investigate the second order asymptotics (source dispersion) of the successive refinement problem. Similarly to the classical definition of a successively refinable source, we say that a source is strongly successively refinable if successive refinement coding can achieve the second order optimum rate (including the dispersion terms) at both decoders. We establish a sufficient condition for strong successive refinability. We show that any discrete source under Hamming distortion and the Gaussian source under quadratic distortion are strongly successively refinable.
We also demonstrate how successive refinement ideas can be used in point-to-point lossy compression problems in order to reduce complexity. We give two examples, the binary-Hamming and Gaussian-quadratic cases, in which a layered code construction results in a low complexity scheme that attains optimal performance. For example, when the number of layers grows with the block length $n$, we show how to design an $O(n^{\log(n)})$ algorithm that asymptotically achieves the rate-distortion bound.
△ Less
Submitted 15 March, 2016; v1 submitted 10 June, 2015;
originally announced June 2015.
-
Does Dirichlet Prior Smoothing Solve the Shannon Entropy Estimation Problem?
Authors:
Yanjun Han,
Jiantao Jiao,
Tsachy Weissman
Abstract:
The Dirichlet prior is widely used in estimating discrete distributions and functionals of discrete distributions. In terms of Shannon entropy estimation, one approach is to plug-in the Dirichlet prior smoothed distribution into the entropy functional, while the other one is to calculate the Bayes estimator for entropy under the Dirichlet prior for squared error, which is the conditional expectati…
▽ More
The Dirichlet prior is widely used in estimating discrete distributions and functionals of discrete distributions. In terms of Shannon entropy estimation, one approach is to plug-in the Dirichlet prior smoothed distribution into the entropy functional, while the other one is to calculate the Bayes estimator for entropy under the Dirichlet prior for squared error, which is the conditional expectation. We show that in general they do \emph{not} improve over the maximum likelihood estimator, which plugs-in the empirical distribution into the entropy functional. No matter how we tune the parameters in the Dirichlet prior, this approach cannot achieve the minimax rates in entropy estimation, as recently characterized by Jiao, Venkat, Han, and Weissman, and Wu and Yang. The performance of the minimax rate-optimal estimator with $n$ samples is essentially \emph{at least} as good as that of the Dirichlet smoothed entropy estimators with $n\ln n$ samples.
We harness the theory of approximation using positive linear operators for analyzing the bias of plug-in estimators for general functionals under arbitrary statistical models, thereby further consolidating the interplay between these two fields, which was thoroughly developed and exploited by Jiao, Venkat, Han, and Weissman. We establish new results in approximation theory, and apply them to analyze the bias of the Dirichlet prior smoothed plug-in entropy estimator. This interplay between bias analysis and approximation theory is of relevance and consequence far beyond the specific problem setting in this paper.
△ Less
Submitted 18 September, 2017; v1 submitted 1 February, 2015;
originally announced February 2015.
-
Adaptive Estimation of Shannon Entropy
Authors:
Yanjun Han,
Jiantao Jiao,
Tsachy Weissman
Abstract:
We consider estimating the Shannon entropy of a discrete distribution $P$ from $n$ i.i.d. samples. Recently, Jiao, Venkat, Han, and Weissman, and Wu and Yang constructed approximation theoretic estimators that achieve the minimax $L_2$ rates in estimating entropy. Their estimators are consistent given $n \gg \frac{S}{\ln S}$ samples, where $S$ is the alphabet size, and it is the best possible samp…
▽ More
We consider estimating the Shannon entropy of a discrete distribution $P$ from $n$ i.i.d. samples. Recently, Jiao, Venkat, Han, and Weissman, and Wu and Yang constructed approximation theoretic estimators that achieve the minimax $L_2$ rates in estimating entropy. Their estimators are consistent given $n \gg \frac{S}{\ln S}$ samples, where $S$ is the alphabet size, and it is the best possible sample complexity. In contrast, the Maximum Likelihood Estimator (MLE), which is the empirical entropy, requires $n\gg S$ samples.
In the present paper we significantly refine the minimax results of existing work. To alleviate the pessimism of minimaxity, we adopt the adaptive estimation framework, and show that the minimax rate-optimal estimator in Jiao, Venkat, Han, and Weissman achieves the minimax rates simultaneously over a nested sequence of subsets of distributions $P$, without knowing the alphabet size $S$ or which subset $P$ lies in. In other words, their estimator is adaptive with respect to this nested sequence of the parameter space, which is characterized by the entropy of the distribution. We also characterize the maximum risk of the MLE over this nested sequence, and show, for every subset in the sequence, that the performance of the minimax rate-optimal estimator with $n$ samples is essentially that of the MLE with $n\ln n$ samples, thereby further substantiating the generality of the phenomenon identified by Jiao, Venkat, Han, and Weissman.
△ Less
Submitted 1 January, 2019; v1 submitted 1 February, 2015;
originally announced February 2015.
-
Minimax Estimation of Discrete Distributions under $\ell_1$ Loss
Authors:
Yanjun Han,
Jiantao Jiao,
Tsachy Weissman
Abstract:
We analyze the problem of discrete distribution estimation under $\ell_1$ loss. We provide non-asymptotic upper and lower bounds on the maximum risk of the empirical distribution (the maximum likelihood estimator), and the minimax risk in regimes where the alphabet size $S$ may grow with the number of observations $n$. We show that among distributions with bounded entropy $H$, the asymptotic maxim…
▽ More
We analyze the problem of discrete distribution estimation under $\ell_1$ loss. We provide non-asymptotic upper and lower bounds on the maximum risk of the empirical distribution (the maximum likelihood estimator), and the minimax risk in regimes where the alphabet size $S$ may grow with the number of observations $n$. We show that among distributions with bounded entropy $H$, the asymptotic maximum risk for the empirical distribution is $2H/\ln n$, while the asymptotic minimax risk is $H/\ln n$. Moreover, Moreover, we show that a hard-thresholding estimator oblivious to the unknown upper bound $H$, is asymptotically minimax. However, if we constrain the estimates to lie in the simplex of probability distributions, then the asymptotic minimax risk is again $2H/\ln n$. We draw connections between our work and the literature on density estimation, entropy estimation, total variation distance ($\ell_1$ divergence) estimation, joint distribution estimation in stochastic processes, normal mean estimation, and adaptive estimation.
△ Less
Submitted 28 December, 2015; v1 submitted 5 November, 2014;
originally announced November 2014.
-
Beyond Maximum Likelihood: from Theory to Practice
Authors:
Jiantao Jiao,
Kartik Venkat,
Yanjun Han,
Tsachy Weissman
Abstract:
Maximum likelihood is the most widely used statistical estimation technique. Recent work by the authors introduced a general methodology for the construction of estimators for functionals in parametric models, and demonstrated improvements - both in theory and in practice - over the maximum likelihood estimator (MLE), particularly in high dimensional scenarios involving parameter dimension compara…
▽ More
Maximum likelihood is the most widely used statistical estimation technique. Recent work by the authors introduced a general methodology for the construction of estimators for functionals in parametric models, and demonstrated improvements - both in theory and in practice - over the maximum likelihood estimator (MLE), particularly in high dimensional scenarios involving parameter dimension comparable to or larger than the number of samples. This approach to estimation, building on results from approximation theory, is shown to yield minimax rate-optimal estimators for a wide class of functionals, implementable with modest computational requirements. In a nutshell, a message of this recent work is that, for a wide class of functionals, the performance of these essentially optimal estimators with $n$ samples is comparable to that of the MLE with $n \ln n$ samples.
In the present paper, we highlight the applicability of the aforementioned methodology to statistical problems beyond functional estimation, and show that it can yield substantial gains. For example, we demonstrate that for learning tree-structured graphical models, our approach achieves a significant reduction of the required data size compared with the classical Chow--Liu algorithm, which is an implementation of the MLE, to achieve the same accuracy. The key step in improving the Chow--Liu algorithm is to replace the empirical mutual information with the estimator for mutual information proposed by the authors. Further, applying the same replacement approach to classical Bayesian network classification, the resulting classifiers uniformly outperform the previous classifiers on 26 widely used datasets.
△ Less
Submitted 25 September, 2014;
originally announced September 2014.
-
Maximum Likelihood Estimation of Functionals of Discrete Distributions
Authors:
Jiantao Jiao,
Kartik Venkat,
Yanjun Han,
Tsachy Weissman
Abstract:
We consider the problem of estimating functionals of discrete distributions, and focus on tight nonasymptotic analysis of the worst case squared error risk of widely used estimators. We apply concentration inequalities to analyze the random fluctuation of these estimators around their expectations, and the theory of approximation using positive linear operators to analyze the deviation of their ex…
▽ More
We consider the problem of estimating functionals of discrete distributions, and focus on tight nonasymptotic analysis of the worst case squared error risk of widely used estimators. We apply concentration inequalities to analyze the random fluctuation of these estimators around their expectations, and the theory of approximation using positive linear operators to analyze the deviation of their expectations from the true functional, namely their \emph{bias}.
We characterize the worst case squared error risk incurred by the Maximum Likelihood Estimator (MLE) in estimating the Shannon entropy $H(P) = \sum_{i = 1}^S -p_i \ln p_i$, and $F_α(P) = \sum_{i = 1}^S p_i^α,α>0$, up to multiplicative constants, for any alphabet size $S\leq \infty$ and sample size $n$ for which the risk may vanish. As a corollary, for Shannon entropy estimation, we show that it is necessary and sufficient to have $n \gg S$ observations for the MLE to be consistent. In addition, we establish that it is necessary and sufficient to consider $n \gg S^{1/α}$ samples for the MLE to consistently estimate $F_α(P), 0<α<1$. The minimax rate-optimal estimators for both problems require $S/\ln S$ and $S^{1/α}/\ln S$ samples, which implies that the MLE has a strictly sub-optimal sample complexity. When $1<α<3/2$, we show that the worst-case squared error rate of convergence for the MLE is $n^{-2(α-1)}$ for infinite alphabet size, while the minimax squared error rate is $(n\ln n)^{-2(α-1)}$. When $α\geq 3/2$, the MLE achieves the minimax optimal rate $n^{-1}$ regardless of the alphabet size.
As an application of the general theory, we analyze the Dirichlet prior smoothing techniques for Shannon entropy estimation. We show that no matter how we tune the parameters in the Dirichlet prior, this technique cannot achieve the minimax rates in entropy estimation.
△ Less
Submitted 9 August, 2017; v1 submitted 26 June, 2014;
originally announced June 2014.
-
Minimax Estimation of Functionals of Discrete Distributions
Authors:
Jiantao Jiao,
Kartik Venkat,
Yanjun Han,
Tsachy Weissman
Abstract:
We propose a general methodology for the construction and analysis of minimax estimators for a wide class of functionals of finite dimensional parameters, and elaborate on the case of discrete distributions, where the alphabet size $S$ is unknown and may be comparable with the number of observations $n$. We treat the respective regions where the functional is "nonsmooth" and "smooth" separately. I…
▽ More
We propose a general methodology for the construction and analysis of minimax estimators for a wide class of functionals of finite dimensional parameters, and elaborate on the case of discrete distributions, where the alphabet size $S$ is unknown and may be comparable with the number of observations $n$. We treat the respective regions where the functional is "nonsmooth" and "smooth" separately. In the "nonsmooth" regime, we apply an unbiased estimator for the best polynomial approximation of the functional whereas, in the "smooth" regime, we apply a bias-corrected Maximum Likelihood Estimator (MLE). We illustrate the merit of this approach by thoroughly analyzing two important cases: the entropy $H(P) = \sum_{i = 1}^S -p_i \ln p_i$ and $F_α(P) = \sum_{i = 1}^S p_i^α,α>0$. We obtain the minimax $L_2$ rates for estimating these functionals. In particular, we demonstrate that our estimator achieves the optimal sample complexity $n \asymp S/\ln S$ for entropy estimation. We also show that the sample complexity for estimating $F_α(P),0<α<1$ is $n\asymp S^{1/α}/ \ln S$, which can be achieved by our estimator but not the MLE. For $1<α<3/2$, we show the minimax $L_2$ rate for estimating $F_α(P)$ is $(n\ln n)^{-2(α-1)}$ regardless of the alphabet size, while the $L_2$ rate for the MLE is $n^{-2(α-1)}$. For all the above cases, the behavior of the minimax rate-optimal estimators with $n$ samples is essentially that of the MLE with $n\ln n$ samples. We highlight the practical advantages of our schemes for entropy and mutual information estimation. We demonstrate that our approach reduces running time and boosts the accuracy compared to existing various approaches. Moreover, we show that the mutual information estimator induced by our methodology leads to significant performance boosts over the Chow--Liu algorithm in learning graphical models.
△ Less
Submitted 10 March, 2015; v1 submitted 26 June, 2014;
originally announced June 2014.
-
Rateless Lossy Compression via the Extremes
Authors:
Albert No,
Tsachy Weissman
Abstract:
We begin by presenting a simple lossy compressor operating at near-zero rate: The encoder merely describes the indices of the few maximal source components, while the decoder's reconstruction is a natural estimate of the source components based on this information. This scheme turns out to be near-optimal for the memoryless Gaussian source in the sense of achieving the zero-rate slope of its disto…
▽ More
We begin by presenting a simple lossy compressor operating at near-zero rate: The encoder merely describes the indices of the few maximal source components, while the decoder's reconstruction is a natural estimate of the source components based on this information. This scheme turns out to be near-optimal for the memoryless Gaussian source in the sense of achieving the zero-rate slope of its distortion-rate function. Motivated by this finding, we then propose a scheme comprised of iterating the above lossy compressor on an appropriately transformed version of the difference between the source and its reconstruction from the previous iteration. The proposed scheme achieves the rate distortion function of the Gaussian memoryless source (under squared error distortion) when employed on any finite-variance ergodic source. It further possesses desirable properties we respectively refer to as infinitesimal successive refinability, ratelessness, and complete separability. Its storage and computation requirements are of order no more than $\frac{n^2}{\log^β n}$ per source symbol for $β>0$ at both the encoder and decoder. Though the details of its derivation, construction, and analysis differ considerably, we discuss similarities between the proposed scheme and the recently introduced Sparse Regression Codes (SPARC) of Venkataramanan et al.
△ Less
Submitted 8 March, 2016; v1 submitted 25 June, 2014;
originally announced June 2014.
-
Distortion-Rate Function of Sub-Nyquist Sampled Gaussian Sources
Authors:
Alon Kipnis,
Andrea J. Goldsmith,
Yonina C. Eldar,
Tsachy Weissman
Abstract:
The amount of information lost in sub-Nyquist sampling of a continuous-time Gaussian stationary process is quantified. We consider a combined source coding and sub-Nyquist reconstruction problem in which the input to the encoder is a noisy sub-Nyquist sampled version of the analog source. We first derive an expression for the mean squared error in the reconstruction of the process from a noisy and…
▽ More
The amount of information lost in sub-Nyquist sampling of a continuous-time Gaussian stationary process is quantified. We consider a combined source coding and sub-Nyquist reconstruction problem in which the input to the encoder is a noisy sub-Nyquist sampled version of the analog source. We first derive an expression for the mean squared error in the reconstruction of the process from a noisy and information rate-limited version of its samples. This expression is a function of the sampling frequency and the average number of bits describing each sample. It is given as the sum of two terms: Minimum mean square error in estimating the source from its noisy but otherwise fully observed sub-Nyquist samples, and a second term obtained by reverse waterfilling over an average of spectral densities associated with the polyphase components of the source. We extend this result to multi-branch uniform sampling, where the samples are available through a set of parallel channels with a uniform sampler and a pre-sampling filter in each branch. Further optimization to reduce distortion is then performed over the pre-sampling filters, and an optimal set of pre-sampling filters associated with the statistics of the input signal and the sampling frequency is found. This results in an expression for the minimal possible distortion achievable under any analog to digital conversion scheme involving uniform sampling and linear filtering. These results thus unify the Shannon-Whittaker-Kotelnikov sampling theorem and Shannon rate-distortion theory for Gaussian sources.
△ Less
Submitted 6 November, 2015; v1 submitted 21 May, 2014;
originally announced May 2014.
-
Relations between Information and Estimation in Discrete-Time Lévy Channels
Authors:
Jiantao Jiao,
Kartik Venkat,
Tsachy Weissman
Abstract:
Fundamental relations between information and estimation have been established in the literature for the discrete-time Gaussian and Poisson channels. In this work, we demonstrate that such relations hold for a much larger class of observation models. We introduce the natural family of discrete-time Lévy channels where the distribution of the output conditioned on the input is infinitely divisible.…
▽ More
Fundamental relations between information and estimation have been established in the literature for the discrete-time Gaussian and Poisson channels. In this work, we demonstrate that such relations hold for a much larger class of observation models. We introduce the natural family of discrete-time Lévy channels where the distribution of the output conditioned on the input is infinitely divisible. For Lévy channels, we establish new representations relating the mutual information between the channel input and output to an optimal expected estimation loss, thereby unifying and considerably extending results from the Gaussian and Poisson settings. We demonstrate the richness of our results by working out two examples of Lévy channels, namely the gamma channel and the negative binomial channel, with corresponding relations between information and estimation. Extensions to the setting of mismatched estimation are also presented.
△ Less
Submitted 1 February, 2017; v1 submitted 27 April, 2014;
originally announced April 2014.
-
Information Measures: the Curious Case of the Binary Alphabet
Authors:
Jiantao Jiao,
Thomas Courtade,
Albert No,
Kartik Venkat,
Tsachy Weissman
Abstract:
Four problems related to information divergence measures defined on finite alphabets are considered. In three of the cases we consider, we illustrate a contrast which arises between the binary-alphabet and larger-alphabet settings. This is surprising in some instances, since characterizations for the larger-alphabet settings do not generalize their binary-alphabet counterparts. Specifically, we sh…
▽ More
Four problems related to information divergence measures defined on finite alphabets are considered. In three of the cases we consider, we illustrate a contrast which arises between the binary-alphabet and larger-alphabet settings. This is surprising in some instances, since characterizations for the larger-alphabet settings do not generalize their binary-alphabet counterparts. Specifically, we show that $f$-divergences are not the unique decomposable divergences on binary alphabets that satisfy the data processing inequality, thereby clarifying claims that have previously appeared in the literature. We also show that KL divergence is the unique Bregman divergence which is also an $f$-divergence for any alphabet size. We show that KL divergence is the unique Bregman divergence which is invariant to statistically sufficient transformations of the data, even when non-decomposable divergences are considered. Like some of the problems we consider, this result holds only when the alphabet size is at least three.
△ Less
Submitted 28 November, 2014; v1 submitted 27 April, 2014;
originally announced April 2014.