Search | arXiv e-print repository

arXiv:2005.09800 [pdf, other]

Fingerprinting Encrypted Voice Traffic on Smart Speakers with Deep Learning

Authors: Chenggang Wang, Sean Kennedy, Haipeng Li, King Hudson, Gowtham Atluri, Xuetao Wei, Wenhai Sun, Boyang Wang

Abstract: This paper investigates the privacy leakage of smart speakers under an encrypted traffic analysis attack, referred to as voice command fingerprinting. In this attack, an adversary can eavesdrop both outgoing and incoming encrypted voice traffic of a smart speaker, and infers which voice command a user says over encrypted traffic. We first built an automatic voice traffic collection tool and collec… ▽ More This paper investigates the privacy leakage of smart speakers under an encrypted traffic analysis attack, referred to as voice command fingerprinting. In this attack, an adversary can eavesdrop both outgoing and incoming encrypted voice traffic of a smart speaker, and infers which voice command a user says over encrypted traffic. We first built an automatic voice traffic collection tool and collected two large-scale datasets on two smart speakers, Amazon Echo and Google Home. Then, we implemented proof-of-concept attacks by leveraging deep learning. Our experimental results over the two datasets indicate disturbing privacy concerns. Specifically, compared to 1% accuracy with random guess, our attacks can correctly infer voice commands over encrypted traffic with 92.89\% accuracy on Amazon Echo. Despite variances that human voices may cause on outgoing traffic, our proof-of-concept attacks remain effective even only leveraging incoming traffic (i.e., the traffic from the server). This is because the AI-based voice services running on the server side response commands in the same voice and with a deterministic or predictable manner in text, which leaves distinguishable pattern over encrypted traffic. We also built a proof-of-concept defense to obfuscate encrypted traffic. Our results show that the defense can effectively mitigate attack accuracy on Amazon Echo to 32.18%. △ Less

Submitted 19 May, 2020; originally announced May 2020.

Journal ref: 13th ACM Conference on Security and Privacy in Wireless and Mobile Networks (WiSec '20), July 8--10, 2020, Linz (Virtual Event), Austria

arXiv:1810.02950 [pdf, ps, other]

doi 10.1109/TKDE.2019.2911681

Mining Novel Multivariate Relationships in Time Series Data Using Correlation Networks

Authors: Saurabh Agrawal, Michael Steinbach, Daniel Boley, Snigdhansu Chatterjee, Gowtham Atluri, Anh The Dang, Stefan Liess, Vipin Kumar

Abstract: In many domains, there is significant interest in capturing novel relationships between time series that represent activities recorded at different nodes of a highly complex system. In this paper, we introduce multipoles, a novel class of linear relationships between more than two time series. A multipole is a set of time series that have strong linear dependence among themselves, with the require… ▽ More In many domains, there is significant interest in capturing novel relationships between time series that represent activities recorded at different nodes of a highly complex system. In this paper, we introduce multipoles, a novel class of linear relationships between more than two time series. A multipole is a set of time series that have strong linear dependence among themselves, with the requirement that each time series makes a significant contribution to the linear dependence. We demonstrate that most interesting multipoles can be identified as cliques of negative correlations in a correlation network. Such cliques are typically rare in a real-world correlation network, which allows us to find almost all multipoles efficiently using a clique-enumeration approach. Using our proposed framework, we demonstrate the utility of multipoles in discovering new physical phenomena in two scientific domains: climate science and neuroscience. In particular, we discovered several multipole relationships that are reproducible in multiple other independent datasets and lead to novel domain insights. △ Less

Submitted 23 April, 2019; v1 submitted 6 October, 2018; originally announced October 2018.

Comments: This is the accepted version of article submitted to IEEE Transactions on Knowledge and Data Engineering 2019

arXiv:1805.05927 [pdf]

CLINIQA: A Machine Intelligence Based Clinical Question Answering System

Authors: M A H Zahid, Ankush Mittal, R. C. Joshi, G. Atluri

Abstract: The recent developments in the field of biomedicine have made large volumes of biomedical literature available to the medical practitioners. Due to the large size and lack of efficient searching strategies, medical practitioners struggle to obtain necessary information available in the biomedical literature. Moreover, the most sophisticated search engines of age are not intelligent enough to inter… ▽ More The recent developments in the field of biomedicine have made large volumes of biomedical literature available to the medical practitioners. Due to the large size and lack of efficient searching strategies, medical practitioners struggle to obtain necessary information available in the biomedical literature. Moreover, the most sophisticated search engines of age are not intelligent enough to interpret the clinicians' questions. These facts reflect the urgent need of an information retrieval system that accepts the queries from medical practitioners' in natural language and returns the answers quickly and efficiently. In this paper, we present an implementation of a machine intelligence based CLINIcal Question Answering system (CLINIQA) to answer medical practitioner's questions. The system was rigorously evaluated on different text mining algorithms and the best components for the system were selected. The system makes use of Unified Medical Language System for semantic analysis of both questions and medical documents. In addition, the system employs supervised machine learning algorithms for classification of the documents, identifying the focus of the question and answer selection. Effective domain-specific heuristics are designed for answer ranking. The performance evaluation on hundred clinical questions shows the effectiveness of our approach. △ Less

Submitted 15 May, 2018; originally announced May 2018.

Comments: This manuscript was submitted to IEEE Transactions on Information Technology in Biomedicine in 2007 and was in second revision when it was withdrawn. As I moved to industry and could not get enough time to revise it. I am uploading it here for anyone interested in conventional ML based approach to NLP

MSC Class: 68T50

arXiv:1802.06095 [pdf, ps, other]

Mining Sub-Interval Relationships In Time Series Data

Authors: Saurabh Agrawal, Saurabh Verma, Gowtham Atluri, Anuj Karpatne, Stefan Liess, Angus Macdonald III, Snigdhansu Chatterjee, Vipin Kumar

Abstract: Time-series data is being increasingly collected and stud- ied in several areas such as neuroscience, climate science, transportation, and social media. Discovery of complex patterns of relationships between individual time-series, using data-driven approaches can improve our understanding of real-world systems. While traditional approaches typically study relationships between two entire time ser… ▽ More Time-series data is being increasingly collected and stud- ied in several areas such as neuroscience, climate science, transportation, and social media. Discovery of complex patterns of relationships between individual time-series, using data-driven approaches can improve our understanding of real-world systems. While traditional approaches typically study relationships between two entire time series, many interesting relationships in real-world applications exist in small sub-intervals of time while remaining absent or feeble during other sub-intervals. In this paper, we define the notion of a sub-interval relationship (SIR) to capture inter- actions between two time series that are prominent only in certain sub-intervals of time. We propose a novel and efficient approach to find most interesting SIR in a pair of time series. We evaluate our proposed approach on two real-world datasets from climate science and neuroscience domain and demonstrated the scalability and computational efficiency of our proposed approach. We further evaluated our discovered SIRs based on a randomization based procedure. Our results indicated the existence of several such relationships that are statistically significant, some of which were also found to have physical interpretation. △ Less

Submitted 16 February, 2018; originally announced February 2018.

arXiv:1711.04710 [pdf, other]

Spatio-Temporal Data Mining: A Survey of Problems and Methods

Authors: Gowtham Atluri, Anuj Karpatne, Vipin Kumar

Abstract: Large volumes of spatio-temporal data are increasingly collected and studied in diverse domains including, climate science, social sciences, neuroscience, epidemiology, transportation, mobile health, and Earth sciences. Spatio-temporal data differs from relational data for which computational approaches are developed in the data mining community for multiple decades, in that both spatial and tempo… ▽ More Large volumes of spatio-temporal data are increasingly collected and studied in diverse domains including, climate science, social sciences, neuroscience, epidemiology, transportation, mobile health, and Earth sciences. Spatio-temporal data differs from relational data for which computational approaches are developed in the data mining community for multiple decades, in that both spatial and temporal attributes are available in addition to the actual measurements/attributes. The presence of these attributes introduces additional challenges that needs to be dealt with. Approaches for mining spatio-temporal data have been studied for over a decade in the data mining community. In this article we present a broad survey of this relatively young field of spatio-temporal data mining. We discuss different types of spatio-temporal data and the relevant data mining questions that arise in the context of analyzing each of these datasets. Based on the nature of the data mining problem studied, we classify literature on spatio-temporal data mining into six major categories: clustering, predictive learning, change detection, frequent pattern mining, anomaly detection, and relationship mining. We discuss the various forms of spatio-temporal data mining problems in each of these categories. △ Less

Submitted 17 November, 2017; v1 submitted 13 November, 2017; originally announced November 2017.

Comments: Accepted for publication at ACM Computing Surveys

arXiv:1612.08544 [pdf, other]

doi 10.1109/TKDE.2017.2720168

Theory-guided Data Science: A New Paradigm for Scientific Discovery from Data

Authors: Anuj Karpatne, Gowtham Atluri, James Faghmous, Michael Steinbach, Arindam Banerjee, Auroop Ganguly, Shashi Shekhar, Nagiza Samatova, Vipin Kumar

Abstract: Data science models, although successful in a number of commercial domains, have had limited applicability in scientific problems involving complex physical phenomena. Theory-guided data science (TGDS) is an emerging paradigm that aims to leverage the wealth of scientific knowledge for improving the effectiveness of data science models in enabling scientific discovery. The overarching vision of TG… ▽ More Data science models, although successful in a number of commercial domains, have had limited applicability in scientific problems involving complex physical phenomena. Theory-guided data science (TGDS) is an emerging paradigm that aims to leverage the wealth of scientific knowledge for improving the effectiveness of data science models in enabling scientific discovery. The overarching vision of TGDS is to introduce scientific consistency as an essential component for learning generalizable models. Further, by producing scientifically interpretable models, TGDS aims to advance our scientific understanding by discovering novel domain insights. Indeed, the paradigm of TGDS has started to gain prominence in a number of scientific disciplines such as turbulence modeling, material discovery, quantum chemistry, bio-medical science, bio-marker discovery, climate science, and hydrology. In this paper, we formally conceptualize the paradigm of TGDS and present a taxonomy of research themes in TGDS. We describe several approaches for integrating domain knowledge in different research themes using illustrative examples from different disciplines. We also highlight some of the promising avenues of novel research for realizing the full potential of theory-guided data science. △ Less

Submitted 13 November, 2017; v1 submitted 27 December, 2016; originally announced December 2016.

Journal ref: IEEE Transactions on Knowledge and Data Engineering, 29(10), pp.2318-2331. 2017

arXiv:1210.6912 [pdf, ps, other]

Enhancing the functional content of protein interaction networks

Authors: Gaurav Pandey, Sahil Manocha, Gowtham Atluri, Vipin Kumar

Abstract: Protein interaction networks are a promising type of data for studying complex biological systems. However, despite the rich information embedded in these networks, they face important data quality challenges of noise and incompleteness that adversely affect the results obtained from their analysis. Here, we explore the use of the concept of common neighborhood similarity (CNS), which is a form of… ▽ More Protein interaction networks are a promising type of data for studying complex biological systems. However, despite the rich information embedded in these networks, they face important data quality challenges of noise and incompleteness that adversely affect the results obtained from their analysis. Here, we explore the use of the concept of common neighborhood similarity (CNS), which is a form of local structure in networks, to address these issues. Although several CNS measures have been proposed in the literature, an understanding of their relative efficacies for the analysis of interaction networks has been lacking. We follow the framework of graph transformation to convert the given interaction network into a transformed network corresponding to a variety of CNS measures evaluated. The effectiveness of each measure is then estimated by comparing the quality of protein function predictions obtained from its corresponding transformed network with those from the original network. Using a large set of S. cerevisiae interactions, and a set of 136 GO terms, we find that several of the transformed networks produce more accurate predictions than those obtained from the original network. In particular, the $HC.cont$ measure proposed here performs particularly well for this task. Further investigation reveals that the two major factors contributing to this improvement are the abilities of CNS measures, especially $HC.cont$, to prune out noisy edges and introduce new links between functionally related proteins. △ Less

Submitted 25 October, 2012; originally announced October 2012.

Showing 1–7 of 7 results for author: Atluri, G