Search | arXiv e-print repository

Distributed Matrix-Based Sampling for Graph Neural Network Training

Authors: Alok Tripathy, Katherine Yelick, Aydin Buluc

Abstract: Graph Neural Networks (GNNs) offer a compact and computationally efficient way to learn embeddings and classifications on graph data. GNN models are frequently large, making distributed minibatch training necessary. The primary contribution of this paper is new methods for reducing communication in the sampling step for distributed GNN training. Here, we propose a matrix-based bulk sampling appr… ▽ More Graph Neural Networks (GNNs) offer a compact and computationally efficient way to learn embeddings and classifications on graph data. GNN models are frequently large, making distributed minibatch training necessary. The primary contribution of this paper is new methods for reducing communication in the sampling step for distributed GNN training. Here, we propose a matrix-based bulk sampling approach that expresses sampling as a sparse matrix multiplication (SpGEMM) and samples multiple minibatches at once. When the input graph topology does not fit on a single device, our method distributes the graph and use communication-avoiding SpGEMM algorithms to scale GNN minibatch sampling, enabling GNN training on much larger graphs than those that can fit into a single device memory. When the input graph topology (but not the embeddings) fits in the memory of one GPU, our approach (1) performs sampling without communication, (2) amortizes the overheads of sampling a minibatch, and (3) can represent multiple sampling algorithms by simply using different matrix constructions. In addition to new methods for sampling, we introduce a pipeline that uses our matrix-based bulk sampling approach to provide end-to-end training results. We provide experimental results on the largest Open Graph Benchmark (OGB) datasets on $128$ GPUs, and show that our pipeline is $2.5\times$ faster than Quiver (a distributed extension to PyTorch-Geometric) on a $3$-layer GraphSAGE network. On datasets outside of OGB, we show a $8.46\times$ speedup on $128$ GPUs in per-epoch time. Finally, we show scaling when the graph is distributed across GPUs and scaling for both node-wise and layer-wise sampling algorithms. △ Less

Submitted 19 April, 2024; v1 submitted 6 November, 2023; originally announced November 2023.

Comments: Proceedings of Machine Learning and Systems

arXiv:2310.10294 [pdf, other]

Key-phrase boosted unsupervised summary generation for FinTech organization

Authors: Aadit Deshpande, Shreya Goyal, Prateek Nagwanshi, Avinash Tripathy

Abstract: With the recent advances in social media, the use of NLP techniques in social media data analysis has become an emerging research direction. Business organizations can particularly benefit from such an analysis of social media discourse, providing an external perspective on consumer behavior. Some of the NLP applications such as intent detection, sentiment classification, text summarization can he… ▽ More With the recent advances in social media, the use of NLP techniques in social media data analysis has become an emerging research direction. Business organizations can particularly benefit from such an analysis of social media discourse, providing an external perspective on consumer behavior. Some of the NLP applications such as intent detection, sentiment classification, text summarization can help FinTech organizations to utilize the social media language data to find useful external insights and can be further utilized for downstream NLP tasks. Particularly, a summary which highlights the intents and sentiments of the users can be very useful for these organizations to get an external perspective. This external perspective can help organizations to better manage their products, offers, promotional campaigns, etc. However, certain challenges, such as a lack of labeled domain-specific datasets impede further exploration of these tasks in the FinTech domain. To overcome these challenges, we design an unsupervised phrase-based summary generation from social media data, using 'Action-Object' pairs (intent phrases). We evaluated the proposed method with other key-phrase based summary generation methods in the direction of contextual information of various Reddit discussion threads, available in the different summaries. We introduce certain "Context Metrics" such as the number of Unique words, Action-Object pairs, and Noun chunks to evaluate the contextual information retrieved from the source text in these phrase-based summaries. We demonstrate that our methods significantly outperform the baseline on these metrics, thus providing a qualitative and quantitative measure of their efficacy. Proposed framework has been leveraged as a web utility portal hosted within Amex. △ Less

Submitted 16 October, 2023; originally announced October 2023.

Comments: 8 pages, 4 figures

arXiv:2309.01662

Towards Persistent Memory based Stateful Serverless Computing for Big Data Applications

Authors: Yuze Li, Kevin Assogba, Abhijit Tripathy, Moiz Arif, M. Mustafa Rafique, Ali R. Butt, Dimitrios Nikolopoulos

Abstract: The Function-as-a-service (FaaS) computing model has recently seen significant growth especially for highly scalable, event-driven applications. The easy-to-deploy and cost-efficient fine-grained billing of FaaS is highly attractive to big data applications. However, the stateless nature of serverless platforms poses major challenges when supporting stateful I/O intensive workloads such as a lack… ▽ More The Function-as-a-service (FaaS) computing model has recently seen significant growth especially for highly scalable, event-driven applications. The easy-to-deploy and cost-efficient fine-grained billing of FaaS is highly attractive to big data applications. However, the stateless nature of serverless platforms poses major challenges when supporting stateful I/O intensive workloads such as a lack of native support for stateful execution, state sharing, and inter-function communication. In this paper, we explore the feasibility of performing stateful big data analytics on serverless platforms and improving I/O throughput of functions by using modern storage technologies such as Intel Optane DC Persistent Memory (PMEM). To this end, we propose Marvel, an end-to-end architecture built on top of the popular serverless platform, Apache OpenWhisk and Apache Hadoop. Marvel makes two main contributions: (1) enable stateful function execution on OpenWhisk by maintaining state information in an in-memory caching layer; and (2) provide access to PMEM backed HDFS storage for faster I/O performance. Our evaluation shows that Marvel reduces the overall execution time of big data applications by up to 86.6% compared to current MapReduce implementations on AWS Lambda. △ Less

Submitted 8 September, 2023; v1 submitted 4 September, 2023; originally announced September 2023.

Comments: Not yet ready to be publicly available

arXiv:2304.07679 [pdf]

doi 10.1145/3580252.3586972

Using Geographic Location-based Public Health Features in Survival Analysis

Authors: Navid Seidi, Ardhendu Tripathy, Sajal K. Das

Abstract: Time elapsed till an event of interest is often modeled using the survival analysis methodology, which estimates a survival score based on the input features. There is a resurgence of interest in develo** more accurate prediction models for time-to-event prediction in personalized healthcare using modern tools such as neural networks. Higher quality features and more frequent observations improv… ▽ More Time elapsed till an event of interest is often modeled using the survival analysis methodology, which estimates a survival score based on the input features. There is a resurgence of interest in develo** more accurate prediction models for time-to-event prediction in personalized healthcare using modern tools such as neural networks. Higher quality features and more frequent observations improve the predictions for a patient, however, the impact of including a patient's geographic location-based public health statistics on individual predictions has not been studied. This paper proposes a complementary improvement to survival analysis models by incorporating public health statistics in the input features. We show that including geographic location-based public health information results in a statistically significant improvement in the concordance index evaluated on the Surveillance, Epidemiology, and End Results (SEER) dataset containing nationwide cancer incidence data. The improvement holds for both the standard Cox proportional hazards model and the state-of-the-art Deep Survival Machines model. Our results indicate the utility of geographic location-based public health features in survival analysis. △ Less

Submitted 15 April, 2023; originally announced April 2023.

Journal ref: 2023 IEEE/ACM Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE), 2023, 80-91

arXiv:2304.00019 [pdf, other]

doi 10.5281/zenodo.7750670

Workflows Community Summit 2022: A Roadmap Revolution

Authors: Rafael Ferreira da Silva, Rosa M. Badia, Venkat Bala, Debbie Bard, Peer-Timo Bremer, Ian Buckley, Silvina Caino-Lores, Kyle Chard, Carole Goble, Shantenu Jha, Daniel S. Katz, Daniel Laney, Manish Parashar, Frederic Suter, Nick Tyler, Thomas Uram, Ilkay Altintas, Stefan Andersson, William Arndt, Juan Aznar, Jonathan Bader, Bartosz Balis, Chris Blanton, Kelly Rosa Braghetto, Aharon Brodutch , et al. (80 additional authors not shown)

Abstract: Scientific workflows have become integral tools in broad scientific computing use cases. Science discovery is increasingly dependent on workflows to orchestrate large and complex scientific experiments that range from execution of a cloud-based data preprocessing pipeline to multi-facility instrument-to-edge-to-HPC computational workflows. Given the changing landscape of scientific computing and t… ▽ More Scientific workflows have become integral tools in broad scientific computing use cases. Science discovery is increasingly dependent on workflows to orchestrate large and complex scientific experiments that range from execution of a cloud-based data preprocessing pipeline to multi-facility instrument-to-edge-to-HPC computational workflows. Given the changing landscape of scientific computing and the evolving needs of emerging scientific applications, it is paramount that the development of novel scientific workflows and system functionalities seek to increase the efficiency, resilience, and pervasiveness of existing systems and applications. Specifically, the proliferation of machine learning/artificial intelligence (ML/AI) workflows, need for processing large scale datasets produced by instruments at the edge, intensification of near real-time data processing, support for long-term experiment campaigns, and emergence of quantum computing as an adjunct to HPC, have significantly changed the functional and operational requirements of workflow systems. Workflow systems now need to, for example, support data streams from the edge-to-cloud-to-HPC enable the management of many small-sized files, allow data reduction while ensuring high accuracy, orchestrate distributed services (workflows, instruments, data movement, provenance, publication, etc.) across computing and user facilities, among others. Further, to accelerate science, it is also necessary that these systems implement specifications/standards and APIs for seamless (horizontal and vertical) integration between systems and applications, as well as enabling the publication of workflows and their associated products according to the FAIR principles. This document reports on discussions and findings from the 2022 international edition of the Workflows Community Summit that took place on November 29 and 30, 2022. △ Less

Submitted 31 March, 2023; originally announced April 2023.

Report number: ORNL/TM-2023/2885

arXiv:2212.02346 [pdf, other]

Accu-Help: A Machine Learning based Smart Healthcare Framework for Accurate Detection of Obsessive Compulsive Disorder

Authors: Kabita Patel, Ajaya Kumar Tripathy, Laxmi Narayan Padhy, Sujita Kumar Kar, Susanta Kumar Padhy, Saraju Prasad Mohanty

Abstract: In recent years the importance of Smart Healthcare cannot be overstated. The current work proposed to expand the state-of-art of smart healthcare in integrating solutions for Obsessive Compulsive Disorder (OCD). Identification of OCD from oxidative stress biomarkers (OSBs) using machine learning is an important development in the study of OCD. However, this process involves the collection of OCD c… ▽ More In recent years the importance of Smart Healthcare cannot be overstated. The current work proposed to expand the state-of-art of smart healthcare in integrating solutions for Obsessive Compulsive Disorder (OCD). Identification of OCD from oxidative stress biomarkers (OSBs) using machine learning is an important development in the study of OCD. However, this process involves the collection of OCD class labels from hospitals, collection of corresponding OSBs from biochemical laboratories, integrated and labeled dataset creation, use of suitable machine learning algorithm for designing OCD prediction model, and making these prediction models available for different biochemical laboratories for OCD prediction for unlabeled OSBs. Further, from time to time, with significant growth in the volume of the dataset with labeled samples, redesigning the prediction model is required for further use. The whole process requires distributed data collection, data integration, coordination between the hospital and biochemical laboratory, dynamic machine learning OCD prediction mode design using a suitable machine learning algorithm, and making the machine learning model available for the biochemical laboratories. Kee** all these things in mind, Accu-Help a fully automated, smart, and accurate OCD detection conceptual model is proposed to help the biochemical laboratories for efficient detection of OCD from OSBs. OSBs are classified into three classes: Healthy Individual (HI), OCD Affected Individual (OAI), and Genetically Affected Individual (GAI). The main component of this proposed framework is the machine learning OCD prediction model design. In this Accu-Help, a neural network-based approach is presented with an OCD prediction accuracy of 86 percent. △ Less

Submitted 5 December, 2022; originally announced December 2022.

arXiv:2105.03280 [pdf, other]

Potential Idiomatic Expression (PIE)-English: Corpus for Classes of Idioms

Authors: Tosin P. Adewumi, Roshanak Vadoodi, Aparajita Tripathy, Konstantina Nikolaidou, Foteini Liwicki, Marcus Liwicki

Abstract: We present a fairly large, Potential Idiomatic Expression (PIE) dataset for Natural Language Processing (NLP) in English. The challenges with NLP systems with regards to tasks such as Machine Translation (MT), word sense disambiguation (WSD) and information retrieval make it imperative to have a labelled idioms dataset with classes such as it is in this work. To the best of the authors' knowledge,… ▽ More We present a fairly large, Potential Idiomatic Expression (PIE) dataset for Natural Language Processing (NLP) in English. The challenges with NLP systems with regards to tasks such as Machine Translation (MT), word sense disambiguation (WSD) and information retrieval make it imperative to have a labelled idioms dataset with classes such as it is in this work. To the best of the authors' knowledge, this is the first idioms corpus with classes of idioms beyond the literal and the general idioms classification. In particular, the following classes are labelled in the dataset: metaphor, simile, euphemism, parallelism, personification, oxymoron, paradox, hyperbole, irony and literal. We obtain an overall inter-annotator agreement (IAA) score, between two independent annotators, of 88.89%. Many past efforts have been limited in the corpus size and classes of samples but this dataset contains over 20,100 samples with almost 1,200 cases of idioms (with their meanings) from 10 classes (or senses). The corpus may also be extended by researchers to meet specific needs. The corpus has part of speech (PoS) tagging from the NLTK library. Classification experiments performed on the corpus to obtain a baseline and comparison among three common models, including the BERT model, give good results. We also make publicly available the corpus and the relevant codes for working with it for NLP tasks. △ Less

Submitted 23 April, 2022; v1 submitted 25 April, 2021; originally announced May 2021.

Comments: Accepted at the International Conference on Language Resources and Evaluation (LREC) 2022

arXiv:2104.00792 [pdf, other]

Scalable Hash Table for NUMA Systems

Authors: Alok Tripathy, Oded Green

Abstract: Hash tables are used in a plethora of applications, including database operations, DNA sequencing, string searching, and many more. As such, there are many parallelized hash tables targeting multicore, distributed, and accelerator-based systems. We present in this work a multi-GPU hash table implementation that can process keys at a throughput comparable to that of distributed hash tables. Distrib… ▽ More Hash tables are used in a plethora of applications, including database operations, DNA sequencing, string searching, and many more. As such, there are many parallelized hash tables targeting multicore, distributed, and accelerator-based systems. We present in this work a multi-GPU hash table implementation that can process keys at a throughput comparable to that of distributed hash tables. Distributed CPU hash tables have received significantly more attention than GPU-based hash tables. We show that a single node with multiple GPUs offers roughly the same performance as a 500-1,000-core CPU-based cluster. Our algorithm's key component is our use of multiple sparse-graph data structures and binning techniques to build the hash table. As has been shown individually, these components can be written with massive parallelism that is amenable to GPU acceleration. Since we focus on an individual node, we also leverage communication primitives that are typically prohibitive in distributed environments. We show that our new multi-GPU algorithm shares many of the same features of the single GPU algorithm -- thus we have efficient collision management capabilities and can deal with a large number of duplicates. We evaluate our algorithm on two multi-GPU compute nodes: 1) an NVIDIA DGX2 server with 16 GPUs and 2) an IBM Power 9 Processor with 6 NVIDIA GPUs. With 32-bit keys, our implementation processes 8B keys per second, comparable to some 500-1,000-core CPU-based clusters and 4X faster than prior single-GPU implementations. △ Less

Submitted 1 April, 2021; originally announced April 2021.

arXiv:2103.05057 [pdf, ps, other]

Nearest Neighbor Search Under Uncertainty

Authors: Blake Mason, Ardhendu Tripathy, Robert Nowak

Abstract: Nearest Neighbor Search (NNS) is a central task in knowledge representation, learning, and reasoning. There is vast literature on efficient algorithms for constructing data structures and performing exact and approximate NNS. This paper studies NNS under Uncertainty (NNSU). Specifically, consider the setting in which an NNS algorithm has access only to a stochastic distance oracle that provides a… ▽ More Nearest Neighbor Search (NNS) is a central task in knowledge representation, learning, and reasoning. There is vast literature on efficient algorithms for constructing data structures and performing exact and approximate NNS. This paper studies NNS under Uncertainty (NNSU). Specifically, consider the setting in which an NNS algorithm has access only to a stochastic distance oracle that provides a noisy, unbiased estimate of the distance between any pair of points, rather than the exact distance. This models many situations of practical importance, including NNS based on human similarity judgements, physical measurements, or fast, randomized approximations to exact distances. A naive approach to NNSU could employ any standard NNS algorithm and repeatedly query and average results from the stochastic oracle (to reduce noise) whenever it needs a pairwise distance. The problem is that a sufficient number of repeated queries is unknown in advance; e.g., a point maybe distant from all but one other point (crude distance estimates suffice) or it may be close to a large number of other points (accurate estimates are necessary). This paper shows how ideas from cover trees and multi-armed bandits can be leveraged to develop an NNSU algorithm that has optimal dependence on the dataset size and the (unknown)geometry of the dataset. △ Less

Submitted 8 March, 2021; originally announced March 2021.

Comments: 22 pages

arXiv:2012.08073 [pdf, other]

Chernoff Sampling for Active Testing and Extension to Active Regression

Authors: Subhojyoti Mukherjee, Ardhendu Tripathy, Robert Nowak

Abstract: Active learning can reduce the number of samples needed to perform a hypothesis test and to estimate the parameters of a model. In this paper, we revisit the work of Chernoff that described an asymptotically optimal algorithm for performing a hypothesis test. We obtain a novel sample complexity bound for Chernoff's algorithm, with a non-asymptotic term that characterizes its performance at a fixed… ▽ More Active learning can reduce the number of samples needed to perform a hypothesis test and to estimate the parameters of a model. In this paper, we revisit the work of Chernoff that described an asymptotically optimal algorithm for performing a hypothesis test. We obtain a novel sample complexity bound for Chernoff's algorithm, with a non-asymptotic term that characterizes its performance at a fixed confidence level. We also develop an extension of Chernoff sampling that can be used to estimate the parameters of a wide variety of models and we obtain a non-asymptotic bound on the estimation error. We apply our extension of Chernoff sampling to actively learn neural network models and to estimate parameters in real-data linear and non-linear regression problems, where our approach performs favorably to state-of-the-art methods. △ Less

Submitted 10 March, 2022; v1 submitted 14 December, 2020; originally announced December 2020.

Comments: 47 pages, 9 figures

arXiv:2006.08850 [pdf, other]

Finding All ε-Good Arms in Stochastic Bandits

Authors: Blake Mason, Lalit Jain, Ardhendu Tripathy, Robert Nowak

Abstract: The pure-exploration problem in stochastic multi-armed bandits aims to find one or more arms with the largest (or near largest) means. Examples include finding an ε-good arm, best-arm identification, top-k arm identification, and finding all arms with means above a specified threshold. However, the problem of finding all ε-good arms has been overlooked in past work, although arguably this may be t… ▽ More The pure-exploration problem in stochastic multi-armed bandits aims to find one or more arms with the largest (or near largest) means. Examples include finding an ε-good arm, best-arm identification, top-k arm identification, and finding all arms with means above a specified threshold. However, the problem of finding all ε-good arms has been overlooked in past work, although arguably this may be the most natural objective in many applications. For example, a virologist may conduct preliminary laboratory experiments on a large candidate set of treatments and move all ε-good treatments into more expensive clinical trials. Since the ultimate clinical efficacy is uncertain, it is important to identify all ε-good candidates. Mathematically, the all-ε-good arm identification problem presents significant new challenges and surprises that do not arise in the pure-exploration objectives studied in the past. We introduce two algorithms to overcome these and demonstrate their great empirical performance on a large-scale crowd-sourced dataset of 2.2M ratings collected by the New Yorker Caption Contest as well as a dataset testing hundreds of possible cancer drugs. △ Less

Submitted 11 September, 2020; v1 submitted 15 June, 2020; originally announced June 2020.

Comments: 93 total pages (8 main pages + appendices), 12 figures, submitted to NeurIPS 2020

arXiv:2005.03300 [pdf, other]

Reducing Communication in Graph Neural Network Training

Authors: Alok Tripathy, Katherine Yelick, Aydin Buluc

Abstract: Graph Neural Networks (GNNs) are powerful and flexible neural networks that use the naturally sparse connectivity information of the data. GNNs represent this connectivity as sparse matrices, which have lower arithmetic intensity and thus higher communication costs compared to dense matrices, making GNNs harder to scale to high concurrencies than convolutional or fully-connected neural networks.… ▽ More Graph Neural Networks (GNNs) are powerful and flexible neural networks that use the naturally sparse connectivity information of the data. GNNs represent this connectivity as sparse matrices, which have lower arithmetic intensity and thus higher communication costs compared to dense matrices, making GNNs harder to scale to high concurrencies than convolutional or fully-connected neural networks. We introduce a family of parallel algorithms for training GNNs and show that they can asymptotically reduce communication compared to previous parallel GNN training methods. We implement these algorithms, which are based on 1D, 1.5D, 2D, and 3D sparse-dense matrix multiplication, using torch.distributed on GPU-equipped clusters. Our algorithms optimize communication across the full GNN training pipeline. We train GNNs on over a hundred GPUs on multiple datasets, including a protein network with over a billion edges. △ Less

Submitted 2 September, 2020; v1 submitted 7 May, 2020; originally announced May 2020.

Comments: To appear in International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'20)

arXiv:2002.01044 [pdf, other]

Optimal Confidence Regions for the Multinomial Parameter

Authors: Matthew L. Malloy, Ardhendu Tripathy, Robert D. Nowak

Abstract: Construction of tight confidence regions and intervals is central to statistical inference and decision making. This paper develops new theory showing minimum average volume confidence regions for categorical data. More precisely, consider an empirical distribution $\widehat{\boldsymbol{p}}$ generated from $n$ iid realizations of a random variable that takes one of $k$ possible values according to… ▽ More Construction of tight confidence regions and intervals is central to statistical inference and decision making. This paper develops new theory showing minimum average volume confidence regions for categorical data. More precisely, consider an empirical distribution $\widehat{\boldsymbol{p}}$ generated from $n$ iid realizations of a random variable that takes one of $k$ possible values according to an unknown distribution $\boldsymbol{p}$. This is analogous to a single draw from a multinomial distribution. A confidence region is a subset of the probability simplex that depends on $\widehat{\boldsymbol{p}}$ and contains the unknown $\boldsymbol{p}$ with a specified confidence. This paper shows how one can construct minimum average volume confidence regions, answering a long standing question. We also show the optimality of the regions directly translates to optimal confidence intervals of linear functionals such as the mean, implying sample complexity and regret improvements for adaptive machine learning algorithms. △ Less

Submitted 29 January, 2021; v1 submitted 3 February, 2020; originally announced February 2020.

arXiv:1906.00547 [pdf, other]

MaxGap Bandit: Adaptive Algorithms for Approximate Ranking

Authors: Sumeet Katariya, Ardhendu Tripathy, Robert Nowak

Abstract: This paper studies the problem of adaptively sampling from K distributions (arms) in order to identify the largest gap between any two adjacent means. We call this the MaxGap-bandit problem. This problem arises naturally in approximate ranking, noisy sorting, outlier detection, and top-arm identification in bandits. The key novelty of the MaxGap-bandit problem is that it aims to adaptively determi… ▽ More This paper studies the problem of adaptively sampling from K distributions (arms) in order to identify the largest gap between any two adjacent means. We call this the MaxGap-bandit problem. This problem arises naturally in approximate ranking, noisy sorting, outlier detection, and top-arm identification in bandits. The key novelty of the MaxGap-bandit problem is that it aims to adaptively determine the natural partitioning of the distributions into a subset with larger means and a subset with smaller means, where the split is determined by the largest gap rather than a pre-specified rank or threshold. Estimating an arm's gap requires sampling its neighboring arms in addition to itself, and this dependence results in a novel hardness parameter that characterizes the sample complexity of the problem. We propose elimination and UCB-style algorithms and show that they are minimax optimal. Our experiments show that the UCB-style algorithms require 6-8x fewer samples than non-adaptive sampling to achieve the same error. △ Less

Submitted 2 June, 2019; originally announced June 2019.

arXiv:1905.13267 [pdf, other]

Learning Nearest Neighbor Graphs from Noisy Distance Samples

Authors: Blake Mason, Ardhendu Tripathy, Robert Nowak

Abstract: We consider the problem of learning the nearest neighbor graph of a dataset of n items. The metric is unknown, but we can query an oracle to obtain a noisy estimate of the distance between any pair of items. This framework applies to problem domains where one wants to learn people's preferences from responses commonly modeled as noisy distance judgments. In this paper, we propose an active algorit… ▽ More We consider the problem of learning the nearest neighbor graph of a dataset of n items. The metric is unknown, but we can query an oracle to obtain a noisy estimate of the distance between any pair of items. This framework applies to problem domains where one wants to learn people's preferences from responses commonly modeled as noisy distance judgments. In this paper, we propose an active algorithm to find the graph with high probability and analyze its query complexity. In contrast to existing work that forces Euclidean structure, our method is valid for general metrics, assuming only symmetry and the triangle inequality. Furthermore, we demonstrate efficiency of our method empirically and theoretically, needing only O(n log(n)Delta^-2) queries in favorable settings, where Delta^-2 accounts for the effect of noise. Using crowd-sourced data collected for a subset of the UT Zappos50K dataset, we apply our algorithm to learn which shoes people believe are most similar and show that it beats both an active baseline and ordinal embedding. △ Less

Submitted 30 May, 2019; originally announced May 2019.

Comments: 21 total pages (8 main pages + appendices), 7 figures, submitted to NeurIPS 2019

arXiv:1805.03730 [pdf, other]

Zero-error Function Computation on a Directed Acyclic Network

Authors: Ardhendu Tripathy, Aditya Ramamoorthy

Abstract: We study the rate region of variable-length source-network codes that are used to compute a function of messages observed over a network. The particular network considered here is the simplest instance of a directed acyclic graph (DAG) that is not a tree. Existing work on zero-error function computation in DAG networks provides bounds on the \textit{computation capacity}, which is a measure of the… ▽ More We study the rate region of variable-length source-network codes that are used to compute a function of messages observed over a network. The particular network considered here is the simplest instance of a directed acyclic graph (DAG) that is not a tree. Existing work on zero-error function computation in DAG networks provides bounds on the \textit{computation capacity}, which is a measure of the amount of communication required per edge in the worst case. This work focuses on the average case: an achievable rate tuple describes the expected amount of communication required on each edge, where the expectation is over the probability mass function of the source messages. We describe a systematic procedure to obtain outer bounds to the rate region for computing an arbitrary demand function at the terminal. Our bounding technique works by lower bounding the entropy of the descriptions observed by the terminal conditioned on the function value and by utilizing the Schur-concave property of the entropy function. We evaluate these bounds for certain example demand functions. △ Less

Submitted 9 May, 2018; originally announced May 2018.

Comments: 18 pages, 2 figures, submitted to IEEE Transactions on Information Theory

arXiv:1712.07008 [pdf, other]

Privacy-Preserving Adversarial Networks

Authors: Ardhendu Tripathy, Ye Wang, Prakash Ishwar

Abstract: We propose a data-driven framework for optimizing privacy-preserving data release mechanisms to attain the information-theoretically optimal tradeoff between minimizing distortion of useful data and concealing specific sensitive information. Our approach employs adversarially-trained neural networks to implement randomized mechanisms and to perform a variational approximation of mutual information… ▽ More We propose a data-driven framework for optimizing privacy-preserving data release mechanisms to attain the information-theoretically optimal tradeoff between minimizing distortion of useful data and concealing specific sensitive information. Our approach employs adversarially-trained neural networks to implement randomized mechanisms and to perform a variational approximation of mutual information privacy. We validate our Privacy-Preserving Adversarial Networks (PPAN) framework via proof-of-concept experiments on discrete and continuous synthetic data, as well as the MNIST handwritten digits dataset. For synthetic data, our model-agnostic PPAN approach achieves tradeoff points very close to the optimal tradeoffs that are analytically-derived from model knowledge. In experiments with the MNIST data, we visually demonstrate a learned tradeoff between minimizing the pixel-level distortion versus concealing the written digit. △ Less

Submitted 12 June, 2019; v1 submitted 19 December, 2017; originally announced December 2017.

Comments: 16 pages

MSC Class: 94A15; 68T05; 62B10

arXiv:1612.07773 [pdf, other]

Sum-networks from undirected graphs: construction and capacity analysis

Authors: Ardhendu Tripathy, Aditya Ramamoorthy

Abstract: We consider a directed acyclic network with multiple sources and multiple terminals where each terminal is interested in decoding the sum of independent sources generated at the source nodes. We describe a procedure whereby a simple undirected graph can be used to construct such a sum-network and demonstrate an upper bound on its computation rate. Furthermore, we show sufficient conditions for the… ▽ More We consider a directed acyclic network with multiple sources and multiple terminals where each terminal is interested in decoding the sum of independent sources generated at the source nodes. We describe a procedure whereby a simple undirected graph can be used to construct such a sum-network and demonstrate an upper bound on its computation rate. Furthermore, we show sufficient conditions for the construction of a linear network code that achieves this upper bound. Our procedure allows us to construct sum-networks that have any arbitrary computation rate $\frac{p}{q}$ (where $p,q$ are non-negative integers). Our work significantly generalizes a previous approach for constructing sum-networks with arbitrary capacities. Specifically, we answer an open question in prior work by demonstrating sum-networks with significantly fewer number of sources and terminals. △ Less

Submitted 22 December, 2016; originally announced December 2016.

Comments: This has an extra Appendix section giving more information about Remark 2 of the original paper published in the 52nd Annual Allerton Conference on Communication, Control and Computing, 2014

arXiv:1611.01887 [pdf, other]

Sum-networks from incidence structures: construction and capacity analysis

Authors: Ardhendu Tripathy, Aditya Ramamoorthy

Abstract: A sum-network is an instance of a network coding problem over a directed acyclic network in which each terminal node wants to compute the sum over a finite field of the information observed at all the source nodes. Many characteristics of the well-studied multiple unicast network communication problem also hold for sum-networks due to a known reduction between instances of these two problems. In t… ▽ More A sum-network is an instance of a network coding problem over a directed acyclic network in which each terminal node wants to compute the sum over a finite field of the information observed at all the source nodes. Many characteristics of the well-studied multiple unicast network communication problem also hold for sum-networks due to a known reduction between instances of these two problems. In this work, we describe an algorithm to construct families of sum-network instances using incidence structures. The computation capacity of several of these sum-network families is characterized. We demonstrate that unlike the multiple unicast problem, the computation capacity of sum-networks depends on the characteristic of the finite field over which the sum is computed. This dependence is very strong; we show examples of sum-networks that have a rate-1 solution over one characteristic but a rate close to zero over a different characteristic. Additionally, a sum-network can have an arbitrary different number of computation capacities for different alphabets. This is contrast to the multiple unicast problem where it is known that the capacity is independent of the network coding alphabet. △ Less

Submitted 28 January, 2018; v1 submitted 6 November, 2016; originally announced November 2016.

Comments: Version accepted for publication in IEEE Transactions on Information Theory

arXiv:1601.07228 [pdf, ps, other]

On Computation Rates for Arithmetic Sum

Authors: Ardhendu Tripathy, Aditya Ramamoorthy

Abstract: For zero-error function computation over directed acyclic networks, existing upper and lower bounds on the computation capacity are known to be loose. In this work we consider the problem of computing the arithmetic sum over a specific directed acyclic network that is not a tree. We assume the sources to be i.i.d. Bernoulli with parameter $1/2$. Even in this simple setting, we demonstrate that upp… ▽ More For zero-error function computation over directed acyclic networks, existing upper and lower bounds on the computation capacity are known to be loose. In this work we consider the problem of computing the arithmetic sum over a specific directed acyclic network that is not a tree. We assume the sources to be i.i.d. Bernoulli with parameter $1/2$. Even in this simple setting, we demonstrate that upper bounding the computation rate is quite nontrivial. In particular, it requires us to consider variable length network codes and relate the upper bound to equivalently lower bounding the entropy of descriptions observed by the terminal conditioned on the function value. This lower bound is obtained by further lower bounding the entropy of a so-called \textit{clumpy distribution}. We also demonstrate an achievable scheme that uses variable length network codes and in-network compression. △ Less

Submitted 26 January, 2016; originally announced January 2016.

Comments: Full paper for ISIT 2016 submission

arXiv:1510.07439 [pdf, ps, other]

Object Oriented Analysis using Natural Language Processing concepts: A Review

Authors: Abinash Tripathy, Santanu Kumar Rath

Abstract: The Software Development Life Cycle (SDLC) starts with eliciting requirements of the customers in the form of Software Requirement Specification (SRS). SRS document needed for software development is mostly written in Natural Language(NL) convenient for the client. From the SRS document only, the class name, its attributes and the functions incorporated in the body of the class are traced based on… ▽ More The Software Development Life Cycle (SDLC) starts with eliciting requirements of the customers in the form of Software Requirement Specification (SRS). SRS document needed for software development is mostly written in Natural Language(NL) convenient for the client. From the SRS document only, the class name, its attributes and the functions incorporated in the body of the class are traced based on pre-knowledge of analyst. The paper intends to present a review on Object Oriented (OO) analysis using Natural Language Processing (NLP) techniques. This analysis can be manual where domain expert helps to generate the required diagram or automated system, where the system generates the required diagram, from the input in the form of SRS. △ Less

Submitted 26 October, 2015; originally announced October 2015.

Comments: 12 pages, International Journal of Information Processing, 9(3), 38-50, 2015, ISSN : 0973-8215, IK International Publishing House Pvt. Ltd., New Delhi, India

Journal ref: International Journal of Information Processing, vol. 9, no. 3, pp. 38-50, 2015

arXiv:1504.05618 [pdf, other]

Capacity of Sum-networks for Different Message Alphabets

Authors: Ardhendu Tripathy, Aditya Ramamoorthy

Abstract: A sum-network is a directed acyclic network in which all terminal nodes demand the `sum' of the independent information observed at the source nodes. Many characteristics of the well-studied multiple-unicast network communication problem also hold for sum-networks due to a known reduction between instances of these two problems. Our main result is that unlike a multiple unicast network, the coding… ▽ More A sum-network is a directed acyclic network in which all terminal nodes demand the `sum' of the independent information observed at the source nodes. Many characteristics of the well-studied multiple-unicast network communication problem also hold for sum-networks due to a known reduction between instances of these two problems. Our main result is that unlike a multiple unicast network, the coding capacity of a sum-network is dependent on the message alphabet. We demonstrate this using a construction procedure and show that the choice of a message alphabet can reduce the coding capacity of a sum-network from $1$ to close to $0$. △ Less

Submitted 21 April, 2015; originally announced April 2015.

Showing 1–22 of 22 results for author: Tripathy, A