Search | arXiv e-print repository

Information-Theoretic Bayes Risk Lower Bounds for Realizable Models

Abstract: We derive information-theoretic lower bounds on the Bayes risk and generalization error of realizable machine learning models. In particular, we employ an analysis in which the rate-distortion function of the model parameters bounds the required mutual information between the training samples and the model parameters in order to learn a model up to a Bayes risk constraint. For realizable models, w… ▽ More We derive information-theoretic lower bounds on the Bayes risk and generalization error of realizable machine learning models. In particular, we employ an analysis in which the rate-distortion function of the model parameters bounds the required mutual information between the training samples and the model parameters in order to learn a model up to a Bayes risk constraint. For realizable models, we show that both the rate distortion function and mutual information admit expressions that are convenient for analysis. For models that are (roughly) lower Lipschitz in their parameters, we bound the rate distortion function from below, whereas for VC classes, the mutual information is bounded above by $d_\mathrm{vc}\log(n)$. When these conditions match, the Bayes risk with respect to the zero-one loss scales no faster than $Ω(d_\mathrm{vc}/n)$, which matches known outer bounds and minimax lower bounds up to logarithmic factors. We also consider the impact of label noise, providing lower bounds when training and/or test samples are corrupted. △ Less

Submitted 8 November, 2021; originally announced November 2021.

arXiv:2006.05752 [pdf, ps, other]

Anytime MiniBatch: Exploiting Stragglers in Online Distributed Optimization

Authors: Nuwan Ferdinand, Haider Al-Lawati, Stark C. Draper, Matthew Nokleby

Abstract: Distributed optimization is vital in solving large-scale machine learning problems. A widely-shared feature of distributed optimization techniques is the requirement that all nodes complete their assigned tasks in each computational epoch before the system can proceed to the next epoch. In such settings, slow nodes, called stragglers, can greatly slow progress. To mitigate the impact of stragglers… ▽ More Distributed optimization is vital in solving large-scale machine learning problems. A widely-shared feature of distributed optimization techniques is the requirement that all nodes complete their assigned tasks in each computational epoch before the system can proceed to the next epoch. In such settings, slow nodes, called stragglers, can greatly slow progress. To mitigate the impact of stragglers, we propose an online distributed optimization method called Anytime Minibatch. In this approach, all nodes are given a fixed time to compute the gradients of as many data samples as possible. The result is a variable per-node minibatch size. Workers then get a fixed communication time to average their minibatch gradients via several rounds of consensus, which are then used to update primal variables via dual averaging. Anytime Minibatch prevents stragglers from holding up the system without wasting the work that stragglers can complete. We present a convergence analysis and analyze the wall time performance. Our numerical results show that our approach is up to 1.5 times faster in Amazon EC2 and it is up to five times faster when there is greater variability in compute node performance. △ Less

Submitted 10 June, 2020; originally announced June 2020.

Comments: International Conference on Learning Representations (ICLR), May 2019, New Orleans, LA, USA

Journal ref: Proc. of the 7th Int. Conf. on Learning Representations (ICLR), May 2019, New Orleans, LA, USA

arXiv:2005.08854 [pdf, other]

doi 10.1109/JPROC.2020.3021381

Scaling-up Distributed Processing of Data Streams for Machine Learning

Authors: Matthew Nokleby, Haroon Raja, Waheed U. Bajwa

Abstract: Emerging applications of machine learning in numerous areas involve continuous gathering of and learning from streams of data. Real-time incorporation of streaming data into the learned models is essential for improved inference in these applications. Further, these applications often involve data that are either inherently gathered at geographically distributed entities or that are intentionally… ▽ More Emerging applications of machine learning in numerous areas involve continuous gathering of and learning from streams of data. Real-time incorporation of streaming data into the learned models is essential for improved inference in these applications. Further, these applications often involve data that are either inherently gathered at geographically distributed entities or that are intentionally distributed across multiple machines for memory, computational, and/or privacy reasons. Training of models in this distributed, streaming setting requires solving stochastic optimization problems in a collaborative manner over communication links between the physical entities. When the streaming data rate is high compared to the processing capabilities of compute nodes and/or the rate of the communications links, this poses a challenging question: how can one best leverage the incoming data for distributed training under constraints on computing capabilities and/or communications rate? A large body of research has emerged in recent decades to tackle this and related problems. This paper reviews recently developed methods that focus on large-scale distributed stochastic optimization in the compute- and bandwidth-limited regime, with an emphasis on convergence analysis that explicitly accounts for the mismatch between computation, communication and streaming rates. In particular, it focuses on methods that solve: (i) distributed stochastic convex problems, and (ii) distributed principal component analysis, which is a nonconvex problem with geometric structure that permits global convergence. For such methods, the paper discusses recent advances in terms of distributed algorithmic designs when faced with high-rate streaming data. Further, it reviews guarantees underlying these methods, which show there exist regimes in which systems can learn from distributed, streaming data at order-optimal rates. △ Less

Submitted 31 August, 2020; v1 submitted 18 May, 2020; originally announced May 2020.

Comments: 45 pages, 9 figures; preprint of a journal paper published in Proceedings of the IEEE (Special Issue on Optimization for Data-driven Learning and Control)

Journal ref: Proc. of the IEEE, vol. 108, no. 11, pp. 1984-2012, Nov. 2020

arXiv:2004.07268 [pdf, other]

Learning Furniture Compatibility with Graph Neural Networks

Authors: Luisa F. Polania, Mauricio Flores, Yiran Li, Matthew Nokleby

Abstract: We propose a graph neural network (GNN) approach to the problem of predicting the stylistic compatibility of a set of furniture items from images. While most existing results are based on siamese networks which evaluate pairwise compatibility between items, the proposed GNN architecture exploits relational information among groups of items. We present two GNN models, both of which comprise a deep… ▽ More We propose a graph neural network (GNN) approach to the problem of predicting the stylistic compatibility of a set of furniture items from images. While most existing results are based on siamese networks which evaluate pairwise compatibility between items, the proposed GNN architecture exploits relational information among groups of items. We present two GNN models, both of which comprise a deep CNN that extracts a feature representation for each image, a gated recurrent unit (GRU) network that models interactions between the furniture items in a set, and an aggregation function that calculates the compatibility score. In the first model, a generalized contrastive loss function that promotes the generation of clustered embeddings for items belonging to the same furniture set is introduced. Also, in the first model, the edge function between nodes in the GRU and the aggregation function are fixed in order to limit model complexity and allow training on smaller datasets; in the second model, the edge function and aggregation function are learned directly from the data. We demonstrate state-of-the art accuracy for compatibility prediction and "fill in the blank" tasks on the Bonn and Singapore furniture datasets. We further introduce a new dataset, called the Target Furniture Collections dataset, which contains over 6000 furniture items that have been hand-curated by stylists to make up 1632 compatible sets. We also demonstrate superior prediction accuracy on this dataset. △ Less

Submitted 15 April, 2020; originally announced April 2020.

Comments: Accepted for publication at CVPR Workshops

arXiv:1903.07507 [pdf, other]

An Effective Label Noise Model for DNN Text Classification

Authors: Ishan **dal, Daniel Pressel, Brian Lester, Matthew Nokleby

Abstract: Because large, human-annotated datasets suffer from labeling errors, it is crucial to be able to train deep neural networks in the presence of label noise. While training image classification models with label noise have received much attention, training text classification models have not. In this paper, we propose an approach to training deep networks that is robust to label noise. This approach… ▽ More Because large, human-annotated datasets suffer from labeling errors, it is crucial to be able to train deep neural networks in the presence of label noise. While training image classification models with label noise have received much attention, training text classification models have not. In this paper, we propose an approach to training deep networks that is robust to label noise. This approach introduces a non-linear processing layer (noise model) that models the statistics of the label noise into a convolutional neural network (CNN) architecture. The noise model and the CNN weights are learned jointly from noisy training data, which prevents the model from overfitting to erroneous labels. Through extensive experiments on several text classification datasets, we show that this approach enables the CNN to learn better sentence representations and is robust even to extreme label noise. We find that proper initialization and regularization of this noise model is critical. Further, by contrast to results focusing on large batch sizes for mitigating label noise for image classification, we find that altering the batch size does not have much effect on classification performance. △ Less

Submitted 18 March, 2019; originally announced March 2019.

Comments: Accepted at NAACL-HLT 2019 Main Conference Long paper

arXiv:1811.04345 [pdf, other]

Optimizing Taxi Carpool Policies via Reinforcement Learning and Spatio-Temporal Mining

Authors: Ishan **dal, Zhiwei Qin, Xuewen Chen, Matthew Nokleby, Jie** Ye

Abstract: In this paper, we develop a reinforcement learning (RL) based system to learn an effective policy for carpooling that maximizes transportation efficiency so that fewer cars are required to fulfill the given amount of trip demand. For this purpose, first, we develop a deep neural network model, called ST-NN (Spatio-Temporal Neural Network), to predict taxi trip time from the raw GPS trip data. Seco… ▽ More In this paper, we develop a reinforcement learning (RL) based system to learn an effective policy for carpooling that maximizes transportation efficiency so that fewer cars are required to fulfill the given amount of trip demand. For this purpose, first, we develop a deep neural network model, called ST-NN (Spatio-Temporal Neural Network), to predict taxi trip time from the raw GPS trip data. Secondly, we develop a carpooling simulation environment for RL training, with the output of ST-NN and using the NYC taxi trip dataset. In order to maximize transportation efficiency and minimize traffic congestion, we choose the effective distance covered by the driver on a carpool trip as the reward. Therefore, the more effective distance a driver achieves over a trip (i.e. to satisfy more trip demand) the higher the efficiency and the less will be the traffic congestion. We compared the performance of RL learned policy to a fixed policy (which always accepts carpool) as a baseline and obtained promising results that are interpretable and demonstrate the advantage of our RL approach. We also compare the performance of ST-NN to that of state-of-the-art travel time estimation methods and observe that ST-NN significantly improves the prediction performance and is more robust to outliers. △ Less

Submitted 10 November, 2018; originally announced November 2018.

Comments: Accepted at IEEE International Conference on Big Data 2018. arXiv admin note: text overlap with arXiv:1710.04350

arXiv:1810.11499 [pdf, other]

Information Bottleneck Methods for Distributed Learning

Authors: Parinaz Farajiparvar, Ahmad Beirami, Matthew Nokleby

Abstract: We study a distributed learning problem in which Alice sends a compressed distillation of a set of training data to Bob, who uses the distilled version to best solve an associated learning problem. We formalize this as a rate-distortion problem in which the training set is the source and Bob's cross-entropy loss is the distortion measure. We consider this problem for unsupervised learning for batc… ▽ More We study a distributed learning problem in which Alice sends a compressed distillation of a set of training data to Bob, who uses the distilled version to best solve an associated learning problem. We formalize this as a rate-distortion problem in which the training set is the source and Bob's cross-entropy loss is the distortion measure. We consider this problem for unsupervised learning for batch and sequential data. In the batch data, this problem is equivalent to the information bottleneck (IB), and we show that reduced-complexity versions of standard IB methods solve the associated rate-distortion problem. For the streaming data, we present a new algorithm, which may be of independent interest, that solves the rate-distortion problem for Gaussian sources. Furthermore, to improve the results of the iterative algorithm for sequential data we introduce a two-pass version of this algorithm. Finally, we show the dependency of the rate on the number of samples $k$ required for Gaussian sources to ensure cross-entropy loss that scales optimally with the growth of the training set. △ Less

Submitted 26 October, 2018; originally announced October 2018.

arXiv:1810.10957 [pdf, other]

Tensor Matched Kronecker-Structured Subspace Detection for Missing Information

Authors: Ishan **dal, Matthew Nokleby

Abstract: We consider the problem of detecting whether a tensor signal having many missing entities lies within a given low dimensional Kronecker-Structured (KS) subspace. This is a matched subspace detection problem. Tensor matched subspace detection problem is more challenging because of the intertwined signal dimensions. We solve this problem by projecting the signal onto the Kronecker structured subspac… ▽ More We consider the problem of detecting whether a tensor signal having many missing entities lies within a given low dimensional Kronecker-Structured (KS) subspace. This is a matched subspace detection problem. Tensor matched subspace detection problem is more challenging because of the intertwined signal dimensions. We solve this problem by projecting the signal onto the Kronecker structured subspace, which is a Kronecker product of different subspaces corresponding to each signal dimension. Under this framework, we define the KS subspaces and the orthogonal projection of the signal onto the KS subspace. We prove that reliable detection is possible as long as the cardinality of the missing signal is greater than the dimensions of the KS subspace by bounding the residual energy of the sampling signal with high probability. △ Less

Submitted 25 October, 2018; originally announced October 2018.

arXiv:1802.08378 [pdf, other]

Multi-Scale Spectrum Sensing in Dense Multi-Cell Cognitive Networks

Authors: Nicolo Michelusi, Matthew Nokleby, Urbashi Mitra, Robert Calderbank

Abstract: Multi-scale spectrum sensing is proposed to overcome the cost of full network state information on the spectrum occupancy of primary users (PUs) in dense multi-cell cognitive networks. Secondary users (SUs) estimate the local spectrum occupancies and aggregate them hierarchically to estimate spectrum occupancy at multiple spatial scales. Thus, SUs obtain fine-grained estimates of spectrum occupanc… ▽ More Multi-scale spectrum sensing is proposed to overcome the cost of full network state information on the spectrum occupancy of primary users (PUs) in dense multi-cell cognitive networks. Secondary users (SUs) estimate the local spectrum occupancies and aggregate them hierarchically to estimate spectrum occupancy at multiple spatial scales. Thus, SUs obtain fine-grained estimates of spectrum occupancies of nearby cells, more relevant to scheduling tasks, and coarse-grained estimates of those of distant cells. An agglomerative clustering algorithm is proposed to design a cost-effective aggregation tree, matched to the structure of interference, robust to local estimation errors and delays. Given these multi-scale estimates, the SU traffic is adapted in a decentralized fashion in each cell, to optimize the trade-off among SU cell throughput, interference caused to PUs, and mutual SU interference. Numerical evaluations demonstrate a small degradation in SU cell throughput (up to 15% for a 0dB interference-to-noise ratio experienced at PUs) compared to a scheme with full network state information, using only one-third of the cost incurred in the exchange of spectrum estimates. The proposed interference-matched design is shown to significantly outperform a random tree design, by providing more relevant information for network control, and a state-of-the-art consensus-based algorithm, which does not leverage the spatio-temporal structure of interference across the network. △ Less

Submitted 6 December, 2018; v1 submitted 22 February, 2018; originally announced February 2018.

Comments: To appear on IEEE Transactions on Communications

arXiv:1710.04350 [pdf, other]

A Unified Neural Network Approach for Estimating Travel Time and Distance for a Taxi Trip

Authors: Ishan **dal, Tony, Qin, Xuewen Chen, Matthew Nokleby, Jie** Ye

Abstract: In building intelligent transportation systems such as taxi or rideshare services, accurate prediction of travel time and distance is crucial for customer experience and resource management. Using the NYC taxi dataset, which contains taxi trips data collected from GPS-enabled taxis [23], this paper investigates the use of deep neural networks to jointly predict taxi trip time and distance. We prop… ▽ More In building intelligent transportation systems such as taxi or rideshare services, accurate prediction of travel time and distance is crucial for customer experience and resource management. Using the NYC taxi dataset, which contains taxi trips data collected from GPS-enabled taxis [23], this paper investigates the use of deep neural networks to jointly predict taxi trip time and distance. We propose a model, called ST-NN (Spatio-Temporal Neural Network), which first predicts the travel distance between an origin and a destination GPS coordinate, then combines this prediction with the time of day to predict the travel time. The beauty of ST-NN is that it uses only the raw trips data without requiring further feature engineering and provides a joint estimate of travel time and distance. We compare the performance of ST-NN to that of state-of-the-art travel time estimation methods, and we observe that the proposed approach generalizes better than state-of-the-art methods. We show that ST-NN approach significantly reduces the mean absolute error for both predicted travel time and distance, about 17% for travel time prediction. We also observe that the proposed approach is more robust to outliers present in the dataset by testing the performance of ST-NN on the datasets with and without outliers. △ Less

Submitted 11 October, 2017; originally announced October 2017.

arXiv:1705.03419 [pdf, other]

Learning Deep Networks from Noisy Labels with Dropout Regularization

Authors: Ishan **dal, Matthew Nokleby, Xuewen Chen

Abstract: Large datasets often have unreliable labels-such as those obtained from Amazon's Mechanical Turk or social media platforms-and classifiers trained on mislabeled datasets often exhibit poor performance. We present a simple, effective technique for accounting for label noise when training deep neural networks. We augment a standard deep network with a softmax layer that models the label noise statis… ▽ More Large datasets often have unreliable labels-such as those obtained from Amazon's Mechanical Turk or social media platforms-and classifiers trained on mislabeled datasets often exhibit poor performance. We present a simple, effective technique for accounting for label noise when training deep neural networks. We augment a standard deep network with a softmax layer that models the label noise statistics. Then, we train the deep network and noise model jointly via end-to-end stochastic gradient descent on the (perhaps mislabeled) dataset. The augmented model is overdetermined, so in order to encourage the learning of a non-trivial noise model, we apply dropout regularization to the weights of the noise model during training. Numerical experiments on noisy versions of the CIFAR-10 and MNIST datasets show that the proposed dropout technique outperforms state-of-the-art methods. △ Less

Submitted 9 May, 2017; originally announced May 2017.

Comments: Published at 2016 IEEE 16th International Conference on Data Mining

arXiv:1705.02556 [pdf, other]

doi 10.1109/JSTSP.2018.2838549

Classification and Representation via Separable Subspaces: Performance Limits and Algorithms

Authors: Ishan **dal, Matthew Nokleby

Abstract: We study the classification performance of Kronecker-structured models in two asymptotic regimes and developed an algorithm for separable, fast and compact K-S dictionary learning for better classification and representation of multidimensional signals by exploiting the structure in the signal. First, we study the classification performance in terms of diversity order and pairwise geometry of the… ▽ More We study the classification performance of Kronecker-structured models in two asymptotic regimes and developed an algorithm for separable, fast and compact K-S dictionary learning for better classification and representation of multidimensional signals by exploiting the structure in the signal. First, we study the classification performance in terms of diversity order and pairwise geometry of the subspaces. We derive an exact expression for the diversity order as a function of the signal and subspace dimensions of a K-S model. Next, we study the classification capacity, the maximum rate at which the number of classes can grow as the signal dimension goes to infinity. Then we describe a fast algorithm for Kronecker-Structured Learning of Discriminative Dictionaries (K-SLD2). Finally, we evaluate the empirical classification performance of K-S models for the synthetic data, showing that they agree with the diversity order analysis. We also evaluate the performance of K-SLD2 on synthetic and real-world datasets showing that the K-SLD2 balances compact signal representation and good classification performance. △ Less

Submitted 29 December, 2017; v1 submitted 6 May, 2017; originally announced May 2017.

Comments: This paper is submitted to IEEE JSTSP Special Issue on Information-Theoretic Methods in Data Acquisition, Analysis, and Processing 2018

Journal ref: IEEE Journal of Selected Topics in Signal Processing ( Volume: 12 , Issue: 5 , Oct. 2018 )

arXiv:1704.07888 [pdf, other]

doi 10.1109/TSIPN.2018.2866320

Stochastic Optimization from Distributed, Streaming Data in Rate-limited Networks

Authors: Matthew Nokleby, Waheed U. Bajwa

Abstract: Motivated by machine learning applications in networks of sensors, internet-of-things (IoT) devices, and autonomous agents, we propose techniques for distributed stochastic convex learning from high-rate data streams. The setup involves a network of nodes---each one of which has a stream of data arriving at a constant rate---that solve a stochastic convex optimization problem by collaborating with… ▽ More Motivated by machine learning applications in networks of sensors, internet-of-things (IoT) devices, and autonomous agents, we propose techniques for distributed stochastic convex learning from high-rate data streams. The setup involves a network of nodes---each one of which has a stream of data arriving at a constant rate---that solve a stochastic convex optimization problem by collaborating with each other over rate-limited communication links. To this end, we present and analyze two algorithms---termed distributed stochastic approximation mirror descent (D-SAMD) and accelerated distributed stochastic approximation mirror descent (AD-SAMD)---that are based on two stochastic variants of mirror descent and in which nodes collaborate via approximate averaging of the local, noisy subgradients using distributed consensus. Our main contributions are (i) bounds on the convergence rates of D-SAMD and AD-SAMD in terms of the number of nodes, network topology, and ratio of the data streaming and communication rates, and (ii) sufficient conditions for order-optimum convergence of these algorithms. In particular, we show that for sufficiently well-connected networks, distributed learning schemes can obtain order-optimum convergence even if the communications rate is small. Further we find that the use of accelerated methods significantly enlarges the regime in which order-optimum convergence is achieved; this is in contrast to the centralized setting, where accelerated methods usually offer only a modest improvement. Finally, we demonstrate the effectiveness of the proposed algorithms using numerical experiments. △ Less

Submitted 6 August, 2018; v1 submitted 25 April, 2017; originally announced April 2017.

Comments: 16 pages, 6 figures; Accepted for publication in IEEE Transactions on Signal and Information Processing over Networks

Journal ref: Published in IEEE Trans. Signal Inform. Proc. over Netw., vol. 5, no. 1, pp. 152-167, Mar. 2019

arXiv:1702.07973 [pdf, other]

Multi-scale Spectrum Sensing in Small-Cell mm-Wave Cognitive Wireless Networks

Authors: Nicolo Michelusi, Matthew Nokleby, Urbashi Mitra, Robert Calderbank

Abstract: In this paper, a multi-scale approach to spectrum sensing in cognitive cellular networks is proposed. In order to overcome the huge cost incurred in the acquisition of full network state information, a hierarchical scheme is proposed, based on which local state estimates are aggregated up the hierarchy to obtain aggregate state information at multiple scales, which are then sent back to each cell… ▽ More In this paper, a multi-scale approach to spectrum sensing in cognitive cellular networks is proposed. In order to overcome the huge cost incurred in the acquisition of full network state information, a hierarchical scheme is proposed, based on which local state estimates are aggregated up the hierarchy to obtain aggregate state information at multiple scales, which are then sent back to each cell for local decision making. Thus, each cell obtains fine-grained estimates of the channel occupancies of nearby cells, but coarse-grained estimates of those of distant cells. The performance of the aggregation scheme is studied in terms of the trade-off between the throughput achievable by secondary users and the interference generated by the activity of these secondary users to primary users. In order to account for the irregular structure of interference patterns arising from path loss, shadowing, and blockages, which are especially relevant in millimeter wave networks, a greedy algorithm is proposed to find a multi-scale aggregation tree to optimize the performance. It is shown numerically that this tailored hierarchy outperforms a regular tree construction by 60%. △ Less

Submitted 25 February, 2017; originally announced February 2017.

Comments: To appear on ICC 2017

arXiv:1608.01267 [pdf, ps, other]

Low-Dimensional Sha** for High-Dimensional Lattice Codes

Authors: Nuwan S. Ferdinand, Brian M. Kurkoski, Matthew Nokleby, Behnaam Aazhang

Abstract: We propose two low-complexity lattice code constructions that have competitive coding and sha** gains. The first construction, named systematic Voronoi sha**, maps short blocks of integers to the dithered Voronoi integers, which are dithered integers that are uniformly distributed over the Voronoi region of a low-dimensional sha** lattice. Then, these dithered Voronoi integers are encoded us… ▽ More We propose two low-complexity lattice code constructions that have competitive coding and sha** gains. The first construction, named systematic Voronoi sha**, maps short blocks of integers to the dithered Voronoi integers, which are dithered integers that are uniformly distributed over the Voronoi region of a low-dimensional sha** lattice. Then, these dithered Voronoi integers are encoded using a high-dimensional lattice retaining the same sha** and coding gains of low and high-dimensional lattices. A drawback to this construction is that there is no isomorphism between the underlying message and the lattice code, preventing its use in applications such as compute-and- forward. Therefore we propose a second construction, called mixed nested lattice codes, in which a high-dimensional coding lattice is nested inside a concatenation of low-dimensional sha** lattices. This construction not only retains the same sha**/coding gains as first construction but also provides the desired algebraic structure. We numerically study these methods, for point-to-point channels as well as compute-and-forward using low-density lattice codes (LDLCs) as coding lattices and E8 and Barnes-Wall as sha** lattices. Numerical results indicate a sha** gain of up to 0.86 dB, compared to the state-of-the-art of 0.4 dB; furthermore, the proposed method has lower complexity than state-of-the-art approaches. △ Less

Submitted 3 August, 2016; originally announced August 2016.

Comments: 13 pages

arXiv:1605.02268 [pdf, other]

Rate-Distortion Bounds on Bayes Risk in Supervised Learning

Authors: Matthew Nokleby, Ahmad Beirami, Robert Calderbank

Abstract: We present an information-theoretic framework for bounding the number of labeled samples needed to train a classifier in a parametric Bayesian setting. We derive bounds on the average $L_p$ distance between the learned classifier and the true maximum a posteriori classifier, which are well-established surrogates for the excess classification error due to imperfect learning. We provide lower and up… ▽ More We present an information-theoretic framework for bounding the number of labeled samples needed to train a classifier in a parametric Bayesian setting. We derive bounds on the average $L_p$ distance between the learned classifier and the true maximum a posteriori classifier, which are well-established surrogates for the excess classification error due to imperfect learning. We provide lower and upper bounds on the rate-distortion function, using $L_p$ loss as the distortion measure, of a maximum a priori classifier in terms of the differential entropy of the posterior distribution and a quantity called the interpolation dimension, which characterizes the complexity of the parametric distribution family. In addition to expressing the information content of a classifier in terms of lossy compression, the rate-distortion function also expresses the minimum number of bits a learning machine needs to extract from training data to learn a classifier to within a specified $L_p$ tolerance. We use results from universal source coding to express the information content in the training data in terms of the Fisher information of the parametric family and the number of training samples available. The result is a framework for computing lower bounds on the Bayes $L_p$ risk. This framework complements the well-known probably approximately correct (PAC) framework, which provides minimax risk bounds involving the Vapnik-Chervonenkis dimension or Rademacher complexity. Whereas the PAC framework provides upper bounds the risk for the worst-case data distribution, the proposed rate-distortion framework lower bounds the risk averaged over the data distribution. We evaluate the bounds for a variety of data models, including categorical, multinomial, and Gaussian models. In each case the bounds are provably tight orderwise, and in two cases we prove that the bounds are tight up to multiplicative constants. △ Less

Submitted 17 November, 2017; v1 submitted 7 May, 2016; originally announced May 2016.

Comments: Revised submission to IEEE Transactions on Information Theory

arXiv:1404.5187 [pdf, other]

Discrimination on the Grassmann Manifold: Fundamental Limits of Subspace Classifiers

Authors: Matthew Nokleby, Miguel Rodrigues, Robert Calderbank

Abstract: We present fundamental limits on the reliable classification of linear and affine subspaces from noisy, linear features. Drawing an analogy between discrimination among subspaces and communication over vector wireless channels, we propose two Shannon-inspired measures to characterize asymptotic classifier performance. First, we define the classification capacity, which characterizes necessary and… ▽ More We present fundamental limits on the reliable classification of linear and affine subspaces from noisy, linear features. Drawing an analogy between discrimination among subspaces and communication over vector wireless channels, we propose two Shannon-inspired measures to characterize asymptotic classifier performance. First, we define the classification capacity, which characterizes necessary and sufficient conditions for the misclassification probability to vanish as the signal dimension, the number of features, and the number of subspaces to be discerned all approach infinity. Second, we define the diversity-discrimination tradeoff which, by analogy with the diversity-multiplexing tradeoff of fading vector channels, characterizes relationships between the number of discernible subspaces and the misclassification probability as the noise power approaches zero. We derive upper and lower bounds on these measures which are tight in many regimes. Numerical results, including a face recognition application, validate the results in practice. △ Less

Submitted 10 December, 2014; v1 submitted 21 April, 2014; originally announced April 2014.

Comments: 19 pages, 4 figures. Revised submission to IEEE Transactions on Information Theory

arXiv:1208.3251 [pdf, other]

doi 10.1109/JSTSP.2013.2246765

Toward Resource-Optimal Consensus over the Wireless Medium

Authors: Matthew Nokleby, Waheed U. Bajwa, Robert Calderbank, Behnaam Aazhang

Abstract: We carry out a comprehensive study of the resource cost of averaging consensus in wireless networks. Most previous approaches suppose a graphical network, which abstracts away crucial features of the wireless medium, and measure resource consumption only in terms of the total number of transmissions required to achieve consensus. Under a path-loss dominated model, we study the resource requirement… ▽ More We carry out a comprehensive study of the resource cost of averaging consensus in wireless networks. Most previous approaches suppose a graphical network, which abstracts away crucial features of the wireless medium, and measure resource consumption only in terms of the total number of transmissions required to achieve consensus. Under a path-loss dominated model, we study the resource requirements of consensus with respect to three wireless-appropriate metrics: total transmit energy, elapsed time, and time-bandwidth product. First we characterize the performance of several popular gossip algorithms, showing that they may be order-optimal with respect to transmit energy but are strictly suboptimal with respect to elapsed time and time-bandwidth product. Further, we propose a new consensus scheme, termed hierarchical averaging, and show that it is nearly order-optimal with respect to all three metrics. Finally, we examine the effects of quantization, showing that hierarchical averaging provides a nearly order-optimal tradeoff between resource consumption and quantization error. △ Less

Submitted 11 February, 2013; v1 submitted 15 August, 2012; originally announced August 2012.

Comments: 12 pages, 3 figures, to appear in IEEE Journal Selected Topics in Signal Processing, April 2013

Journal ref: IEEE J. Select. Topics Signal Processing, vol. 7, no. 2, pp. 284-295, Apr. 2013

arXiv:1203.0695 [pdf, ps, other]

Cooperative Compute-and-Forward

Authors: Matthew Nokleby, Behnaam Aazhang

Abstract: We examine the benefits of user cooperation under compute-and-forward. Much like in network coding, receivers in a compute-and-forward network recover finite-field linear combinations of transmitters' messages. Recovery is enabled by linear codes: transmitters map messages to a linear codebook, and receivers attempt to decode the incoming superposition of signals to an integer combination of codew… ▽ More We examine the benefits of user cooperation under compute-and-forward. Much like in network coding, receivers in a compute-and-forward network recover finite-field linear combinations of transmitters' messages. Recovery is enabled by linear codes: transmitters map messages to a linear codebook, and receivers attempt to decode the incoming superposition of signals to an integer combination of codewords. However, the achievable computation rates are low if channel gains do not correspond to a suitable linear combination. In response to this challenge, we propose a cooperative approach to compute-and-forward. We devise a lattice-coding approach to block Markov encoding with which we construct a decode-and-forward style computation strategy. Transmitters broadcast lattice codewords, decode each other's messages, and then cooperatively transmit resolution information to aid receivers in decoding the integer combinations. Using our strategy, we show that cooperation offers a significant improvement both in the achievable computation rate and in the diversity-multiplexing tradeoff. △ Less

Submitted 3 March, 2012; originally announced March 2012.

Comments: submitted to IEEE Transactions on Information Theory

Showing 1–19 of 19 results for author: Nokleby, M