-
Photoinduced modulation of refractive index in Langmuir-Blodgett films of Azo-based H-shaped liquid crystal molecules
Authors:
Ashutosh Joshi,
Akash Gayakwad,
Manjuladevi V.,
Mahesh C. Varia,
S. Kumar,
R. K. Gupta
Abstract:
The development of optically active area consisting of organic molecules are essential for the devices like optical switches and waveguides, as it can be easily maneuvered by the application of suitable electromagnetic (EM) waves. In this article, we report the development of a photoactive surface by the deposition of a single layer of Langmuir-Blodgett (LB) film of a novel H-shaped liquid crystal…
▽ More
The development of optically active area consisting of organic molecules are essential for the devices like optical switches and waveguides, as it can be easily maneuvered by the application of suitable electromagnetic (EM) waves. In this article, we report the development of a photoactive surface by the deposition of a single layer of Langmuir-Blodgett (LB) film of a novel H-shaped liquid crystal (HLC) molecule. The synthesized HLC molecules possess azo-groups and nitro-groups. The azo-group can be isomerized (trans-cis transformation) by irradiating them with ultraviolet (UV) light. The nitro-group can provide sufficient amphiphilicity to the HLC molecules to form a stable Langmuir monolayer at air-water interface. The Langmuir monolayer of the HLC molecules exhibited gas and liquid-like phases. A single layer of LB film of HLC molecules was deposited on a gold chip of a home-built surface plasmon resonance (SPR) instrument. The azo-groups of the molecules in LB film was excited by UV irradiation leading to a change in morphology due to trans-cis transformation. Such a change in morphology can lead to a miniscule change in refractive index (RI) of the LB film. SPR is a label free and highly sensitive optical phenomenon for the measurement of such changes in RI. In our studies, we found systematic changes in the resonance angle of the LB film of HLC molecules as a function of intensity of the UV irradiation. We measured switch-on and switch-off intensity which may suggest that the LB film of HLC molecules can find applications in optical switches or waveguides.
△ Less
Submitted 13 August, 2022;
originally announced August 2022.
-
Can the Government Compel Decryption? Don't Trust -- Verify
Authors:
Aloni Cohen,
Sarah Scheffler,
Mayank Varia
Abstract:
If a court knows that a respondent knows the password to a device, can the court compel the respondent to enter that password into the device? In this work, we propose a new approach to the foregone conclusion doctrine from Fisher v US that governs the answer to this question. The Holy Grail of this line of work would be a framework for reasoning about whether the testimony implicit in any action…
▽ More
If a court knows that a respondent knows the password to a device, can the court compel the respondent to enter that password into the device? In this work, we propose a new approach to the foregone conclusion doctrine from Fisher v US that governs the answer to this question. The Holy Grail of this line of work would be a framework for reasoning about whether the testimony implicit in any action is already known to the government. In this paper we attempt something narrower. We introduce a framework for specifying actions for which all implicit testimony is, constructively, a foregone conclusion. Our approach is centered around placing the burden of proof on the government to demonstrate that it is not "rely[ing] on the truthtelling" of the respondent.
Building on original legal analysis and using precise computer science formalisms, we propose demonstrability as a new central concept for describing compelled acts. We additionally provide a language for whether a compelled action meaningfully entails the respondent to perform in a manner that is 'as good as' the government's desired goal. Then, we apply our definitions to analyze the compellability of several cryptographic primitives including decryption, multifactor authentication, commitment schemes, and hash functions. In particular, our framework reaches a novel conclusion about compelled decryption in the setting that the encryption scheme is deniable: the government can compel but the respondent is free to use any password of her choice.
△ Less
Submitted 9 September, 2022; v1 submitted 4 August, 2022;
originally announced August 2022.
-
Formalizing Human Ingenuity: A Quantitative Framework for Copyright Law's Substantial Similarity
Authors:
Sarah Scheffler,
Eran Tromer,
Mayank Varia
Abstract:
A central notion in U.S. copyright law is judging the substantial similarity between an original and an (allegedly) derived work. Capturing this notion has proven elusive, and the many approaches offered by case law and legal scholarship are often ill-defined, contradictory, or internally-inconsistent.
This work suggests that key parts of the substantial-similarity puzzle are amendable to modeli…
▽ More
A central notion in U.S. copyright law is judging the substantial similarity between an original and an (allegedly) derived work. Capturing this notion has proven elusive, and the many approaches offered by case law and legal scholarship are often ill-defined, contradictory, or internally-inconsistent.
This work suggests that key parts of the substantial-similarity puzzle are amendable to modeling inspired by theoretical computer science. Our proposed framework quantitatively evaluates how much "novelty" is needed to produce the derived work with access to the original work, versus reproducing it without access to the copyrighted elements of the original work. "Novelty" is captured by a computational notion of description length, in the spirit of Kolmogorov-Levin complexity, which is robust to mechanical transformations and availability of contextual information.
This results in an actionable framework that could be used by courts as an aid for deciding substantial similarity. We evaluate it on several pivotal cases in copyright law and observe that the results are consistent with the rulings, and are philosophically aligned with the abstraction-filtration-comparison test of Altai.
△ Less
Submitted 14 June, 2022; v1 submitted 2 June, 2022;
originally announced June 2022.
-
Distributed Hardware Accelerated Secure Joint Computation on the COPA Framework
Authors:
Rushi Patel,
Pouya Haghi,
Shweta Jain,
Andriy Kot,
Venkata Krishnan,
Mayank Varia,
Martin Herbordt
Abstract:
Performance of distributed data center applications can be improved through use of FPGA-based SmartNICs, which provide additional functionality and enable higher bandwidth communication. Until lately, however, the lack of a simple approach for customizing SmartNICs to application requirements has limited the potential benefits. Intel's Configurable Network Protocol Accelerator (COPA) provides a cu…
▽ More
Performance of distributed data center applications can be improved through use of FPGA-based SmartNICs, which provide additional functionality and enable higher bandwidth communication. Until lately, however, the lack of a simple approach for customizing SmartNICs to application requirements has limited the potential benefits. Intel's Configurable Network Protocol Accelerator (COPA) provides a customizable FPGA framework that integrates both hardware and software development to improve computation and communication performance. In this first case study, we demonstrate the capabilities of the COPA framework with an application from cryptography -- secure Multi-Party Computation (MPC) -- that utilizes hardware accelerators connected directly to host memory and the COPA network. We find that using the COPA framework gives significant improvements to both computation and communication as compared to traditional implementations of MPC that use CPUs and NICs. A single MPC accelerator running on COPA enables more than 17Gbps of communication bandwidth while using only 1% of Stratix 10 resources. We show that utilizing the COPA framework enables multiple MPC accelerators running in parallel to fully saturate a 100Gbps link enabling higher performance compared to traditional NICs.
△ Less
Submitted 10 April, 2022;
originally announced April 2022.
-
Secrecy: Secure collaborative analytics on secret-shared data
Authors:
John Liagouris,
Vasiliki Kalavri,
Muhammad Faisal,
Mayank Varia
Abstract:
We present a relational MPC framework for secure collaborative analytics on private data with no information leakage. Our work targets challenging use cases where data owners may not have private resources to participate in the computation, thus, they need to securely outsource the data analysis to untrusted third parties. We define a set of oblivious operators, explain the secure primitives they…
▽ More
We present a relational MPC framework for secure collaborative analytics on private data with no information leakage. Our work targets challenging use cases where data owners may not have private resources to participate in the computation, thus, they need to securely outsource the data analysis to untrusted third parties. We define a set of oblivious operators, explain the secure primitives they rely on, and analyze their costs in terms of operations and inter-party communication. We show how these operators can be composed to form end-to-end oblivious queries, and we introduce logical and physical optimizations that dramatically reduce the space and communication requirements during query execution, in some cases from quadratic to linear or from linear to logarithmic with respect to the cardinality of the input.
We implement our framework on top of replicated secret sharing in a system called Secrecy and evaluate it using real queries from several MPC application areas. Our experiments demonstrate that the proposed optimizations can result in over 1000x lower execution times compared to baseline approaches, enabling Secrecy to outperform state-of-the-art frameworks and compute MPC queries on millions of input rows with a single thread per party.
△ Less
Submitted 3 February, 2022; v1 submitted 1 February, 2021;
originally announced February 2021.
-
Secret Sharing MPC on FPGAs in the Datacenter
Authors:
Pierre-Francois Wolfe,
Rushi Patel,
Robert Munafo,
Mayank Varia,
Martin Herbordt
Abstract:
Multi-Party Computation (MPC) is a technique enabling data from several sources to be used in a secure computation revealing only the result while protecting the original data, facilitating shared utilization of data sets gathered by different entities. The presence of Field Programmable Gate Array (FPGA) hardware in datacenters can provide accelerated computing as well as low latency, high bandwi…
▽ More
Multi-Party Computation (MPC) is a technique enabling data from several sources to be used in a secure computation revealing only the result while protecting the original data, facilitating shared utilization of data sets gathered by different entities. The presence of Field Programmable Gate Array (FPGA) hardware in datacenters can provide accelerated computing as well as low latency, high bandwidth communication that bolsters the performance of MPC and lowers the barrier to using MPC for many applications. In this work, we propose a Secret Sharing FPGA design based on the protocol described by Araki et al. We compare our hardware design to the original authors' software implementations of Secret Sharing and to work accelerating MPC protocols based on Garbled Circuits with FPGAs. Our conclusion is that Secret Sharing in the datacenter is competitive and when implemented on FPGA hardware was able to use at least 10$\times$ fewer computer resources than the original work using CPUs.
△ Less
Submitted 1 July, 2020;
originally announced July 2020.
-
Anonymous Collocation Discovery: Harnessing Privacy to Tame the Coronavirus
Authors:
Ran Canetti,
Ari Trachtenberg,
Mayank Varia
Abstract:
Successful containment of the Coronavirus pandemic rests on the ability to quickly and reliably identify those who have been in close proximity to a contagious individual. Existing tools for doing so rely on the collection of exact location information of individuals over lengthy time periods, and combining this information with other personal information. This unprecedented encroachment on indivi…
▽ More
Successful containment of the Coronavirus pandemic rests on the ability to quickly and reliably identify those who have been in close proximity to a contagious individual. Existing tools for doing so rely on the collection of exact location information of individuals over lengthy time periods, and combining this information with other personal information. This unprecedented encroachment on individual privacy at national scales has created an outcry and risks rejection of these tools.
We propose an alternative: an extremely simple scheme for providing fine-grained and timely alerts to users who have been in the close vicinity of an infected individual. Crucially, this is done while preserving the anonymity of all individuals, and without collecting or storing any personal information or location history. Our approach is based on using short-range communication mechanisms, like Bluetooth, that are available in all modern cell phones. It can be deployed with very little infrastructure, and incurs a relatively low false-positive rate compared to other collocation methods. We also describe a number of extensions and tradeoffs.
We believe that the privacy guarantees provided by the scheme will encourage quick and broad voluntary adoption. When combined with sufficient testing capacity and existing best practices from healthcare professionals, we hope that this may significantly reduce the infection rate.
△ Less
Submitted 3 April, 2020; v1 submitted 30 March, 2020;
originally announced March 2020.
-
Case Study: Disclosure of Indirect Device Fingerprinting in Privacy Policies
Authors:
Julissa Milligan,
Sarah Scheffler,
Andrew Sellars,
Trishita Tiwari,
Ari Trachtenberg,
Mayank Varia
Abstract:
Recent developments in online tracking make it harder for individuals to detect and block trackers. Some sites have deployed indirect tracking methods, which attempt to uniquely identify a device by asking the browser to perform a seemingly-unrelated task. One type of indirect tracking, Canvas fingerprinting, causes the browser to render a graphic recording rendering statistics as a unique identif…
▽ More
Recent developments in online tracking make it harder for individuals to detect and block trackers. Some sites have deployed indirect tracking methods, which attempt to uniquely identify a device by asking the browser to perform a seemingly-unrelated task. One type of indirect tracking, Canvas fingerprinting, causes the browser to render a graphic recording rendering statistics as a unique identifier. In this work, we observe how indirect device fingerprinting methods are disclosed in privacy policies, and consider whether the disclosures are sufficient to enable website visitors to block the tracking methods. We compare these disclosures to the disclosure of direct fingerprinting methods on the same websites.
Our case study analyzes one indirect fingerprinting technique, Canvas fingerprinting. We use an existing automated detector of this fingerprinting technique to conservatively detect its use on Alexa Top 500 websites that cater to United States consumers, and we examine the privacy policies of the resulting 28 websites. Disclosures of indirect fingerprinting vary in specificity. None described the specific methods with enough granularity to know the website used Canvas fingerprinting. Conversely, many sites did provide enough detail about usage of direct fingerprinting methods to allow a website visitor to reliably detect and block those techniques.
We conclude that indirect fingerprinting methods are often difficult to detect and are not identified with specificity in privacy policies. This makes indirect fingerprinting more difficult to block, and therefore risks disturbing the tentative armistice between individuals and websites currently in place for direct fingerprinting. This paper illustrates differences in fingerprinting approaches, and explains why technologists, technology lawyers, and policymakers need to appreciate the challenges of indirect fingerprinting.
△ Less
Submitted 21 August, 2019;
originally announced August 2019.
-
Conclave: secure multi-party computation on big data (extended TR)
Authors:
Nikolaj Volgushev,
Malte Schwarzkopf,
Ben Getchell,
Mayank Varia,
Andrei Lapets,
Azer Bestavros
Abstract:
Secure Multi-Party Computation (MPC) allows mutually distrusting parties to run joint computations without revealing private data. Current MPC algorithms scale poorly with data size, which makes MPC on "big data" prohibitively slow and inhibits its practical use.
Many relational analytics queries can maintain MPC's end-to-end security guarantee without using cryptographic MPC techniques for all…
▽ More
Secure Multi-Party Computation (MPC) allows mutually distrusting parties to run joint computations without revealing private data. Current MPC algorithms scale poorly with data size, which makes MPC on "big data" prohibitively slow and inhibits its practical use.
Many relational analytics queries can maintain MPC's end-to-end security guarantee without using cryptographic MPC techniques for all operations. Conclave is a query compiler that accelerates such queries by transforming them into a combination of data-parallel, local cleartext processing and small MPC steps. When parties trust others with specific subsets of the data, Conclave applies new hybrid MPC-cleartext protocols to run additional steps outside of MPC and improve scalability further.
Our Conclave prototype generates code for cleartext processing in Python and Spark, and for secure MPC using the Sharemind and Obliv-C frameworks. Conclave scales to data sets between three and six orders of magnitude larger than state-of-the-art MPC frameworks support on their own. Thanks to its hybrid protocols, Conclave also substantially outperforms SMCQL, the most similar existing system.
△ Less
Submitted 17 February, 2019;
originally announced February 2019.
-
Revealing the Unseen: How to Expose Cloud Usage While Protecting User Privacy
Authors:
Ata Turk,
Mayank Varia,
Georgios Kellaris
Abstract:
Cloud users have little visibility into the performance characteristics and utilization of the physical machines underpinning the virtualized cloud resources they use. This uncertainty forces users and researchers to reverse engineer the inner workings of cloud systems in order to understand and optimize the conditions their applications operate. At Massachusetts Open Cloud (MOC), as a public clou…
▽ More
Cloud users have little visibility into the performance characteristics and utilization of the physical machines underpinning the virtualized cloud resources they use. This uncertainty forces users and researchers to reverse engineer the inner workings of cloud systems in order to understand and optimize the conditions their applications operate. At Massachusetts Open Cloud (MOC), as a public cloud operator, we'd like to expose the utilization of our physical infrastructure to stop this wasteful effort. Mindful that such exposure can be used maliciously for gaining insight into other users workloads, in this position paper we argue for the need for an approach that balances openness of the cloud overall with privacy for each tenant inside of it. We believe that this approach can be instantiated via a novel combination of several security and privacy technologies. We discuss the potential benefits, implications of transparency for cloud systems and users, and technical challenges/possibilities.
△ Less
Submitted 2 October, 2017;
originally announced October 2017.
-
Privacy with Estimation Guarantees
Authors:
Hao Wang,
Lisa Vo,
Flavio P. Calmon,
Muriel Médard,
Ken R. Duffy,
Mayank Varia
Abstract:
We study the central problem in data privacy: how to share data with an analyst while providing both privacy and utility guarantees to the user that owns the data. In this setting, we present an estimation-theoretic analysis of the privacy-utility trade-off (PUT). Here, an analyst is allowed to reconstruct (in a mean-squared error sense) certain functions of the data (utility), while other private…
▽ More
We study the central problem in data privacy: how to share data with an analyst while providing both privacy and utility guarantees to the user that owns the data. In this setting, we present an estimation-theoretic analysis of the privacy-utility trade-off (PUT). Here, an analyst is allowed to reconstruct (in a mean-squared error sense) certain functions of the data (utility), while other private functions should not be reconstructed with distortion below a certain threshold (privacy). We demonstrate how chi-square information captures the fundamental PUT in this case and provide bounds for the best PUT. We propose a convex program to compute privacy-assuring map**s when the functions to be disclosed and hidden are known a priori and the data distribution is known. We derive lower bounds on the minimum mean-squared error of estimating a target function from the disclosed data and evaluate the robustness of our approach when an empirical distribution is used to compute the privacy-assuring map**s instead of the true data distribution. We illustrate the proposed approach through two numerical experiments.
△ Less
Submitted 20 March, 2020; v1 submitted 1 October, 2017;
originally announced October 2017.
-
Principal Inertia Components and Applications
Authors:
Flavio P. Calmon,
Ali Makhdoumi,
Muriel Médard,
Mayank Varia,
Mark Christiansen,
Ken R. Duffy
Abstract:
We explore properties and applications of the Principal Inertia Components (PICs) between two discrete random variables $X$ and $Y$. The PICs lie in the intersection of information and estimation theory, and provide a fine-grained decomposition of the dependence between $X$ and $Y$. Moreover, the PICs describe which functions of $X$ can or cannot be reliably inferred (in terms of MMSE) given an ob…
▽ More
We explore properties and applications of the Principal Inertia Components (PICs) between two discrete random variables $X$ and $Y$. The PICs lie in the intersection of information and estimation theory, and provide a fine-grained decomposition of the dependence between $X$ and $Y$. Moreover, the PICs describe which functions of $X$ can or cannot be reliably inferred (in terms of MMSE) given an observation of $Y$. We demonstrate that the PICs play an important role in information theory, and they can be used to characterize information-theoretic limits of certain estimation problems. In privacy settings, we prove that the PICs are related to fundamental limits of perfect privacy.
△ Less
Submitted 3 April, 2017;
originally announced April 2017.
-
SoK: Cryptographically Protected Database Search
Authors:
Benjamin Fuller,
Mayank Varia,
Arkady Yerukhimovich,
Emily Shen,
Ariel Hamlin,
Vijay Gadepally,
Richard Shay,
John Darby Mitchell,
Robert K. Cunningham
Abstract:
Protected database search systems cryptographically isolate the roles of reading from, writing to, and administering the database. This separation limits unnecessary administrator access and protects data in the case of system breaches. Since protected search was introduced in 2000, the area has grown rapidly; systems are offered by academia, start-ups, and established companies.
However, there…
▽ More
Protected database search systems cryptographically isolate the roles of reading from, writing to, and administering the database. This separation limits unnecessary administrator access and protects data in the case of system breaches. Since protected search was introduced in 2000, the area has grown rapidly; systems are offered by academia, start-ups, and established companies.
However, there is no best protected search system or set of techniques. Design of such systems is a balancing act between security, functionality, performance, and usability. This challenge is made more difficult by ongoing database specialization, as some users will want the functionality of SQL, NoSQL, or NewSQL databases. This database evolution will continue, and the protected search community should be able to quickly provide functionality consistent with newly invented databases.
At the same time, the community must accurately and clearly characterize the tradeoffs between different approaches. To address these challenges, we provide the following contributions:
1) An identification of the important primitive operations across database paradigms. We find there are a small number of base operations that can be used and combined to support a large number of database paradigms.
2) An evaluation of the current state of protected search systems in implementing these base operations. This evaluation describes the main approaches and tradeoffs for each base operation. Furthermore, it puts protected search in the context of unprotected search, identifying key gaps in functionality.
3) An analysis of attacks against protected search for different base queries.
4) A roadmap and tools for transforming a protected search system into a protected database, including an open-source performance evaluation platform and initial user opinions of protected search.
△ Less
Submitted 2 June, 2017; v1 submitted 6 March, 2017;
originally announced March 2017.
-
Parallel Vectorized Algebraic AES in MATLAB for Rapid Prototy** of Encrypted Sensor Processing Algorithms and Database Analytics
Authors:
Jeremy Kepner,
Vijay Gadepally,
Braden Hancock,
Peter Michaleas,
Elizabeth Michel,
Mayank Varia
Abstract:
The increasing use of networked sensor systems and networked databases has led to an increased interest in incorporating encryption directly into sensor algorithms and database analytics. MATLAB is the dominant tool for rapid prototy** of sensor algorithms and has extensive database analytics capabilities. The advent of high level and high performance Galois Field mathematical environments allow…
▽ More
The increasing use of networked sensor systems and networked databases has led to an increased interest in incorporating encryption directly into sensor algorithms and database analytics. MATLAB is the dominant tool for rapid prototy** of sensor algorithms and has extensive database analytics capabilities. The advent of high level and high performance Galois Field mathematical environments allows encryption algorithms to be expressed succinctly and efficiently. This work leverages the Galois Field primitives found the MATLAB Communication Toolbox to implement a mode of the Advanced Encrypted Standard (AES) based on first principals mathematics. The resulting implementation requires 100x less code than standard AES implementations and delivers speed that is effective for many design purposes. The parallel version achieves speed comparable to native OpenSSL on a single node and is sufficient for real-time prototy** of many sensor processing algorithms and database analytics.
△ Less
Submitted 29 June, 2015;
originally announced June 2015.
-
Computing on Masked Data to improve the Security of Big Data
Authors:
Vijay Gadepally,
Braden Hancock,
Benjamin Kaiser,
Jeremy Kepner,
Pete Michaleas,
Mayank Varia,
Arkady Yerukhimovich
Abstract:
Organizations that make use of large quantities of information require the ability to store and process data from central locations so that the product can be shared or distributed across a heterogeneous group of users. However, recent events underscore the need for improving the security of data stored in such untrusted servers or databases. Advances in cryptographic techniques and database techn…
▽ More
Organizations that make use of large quantities of information require the ability to store and process data from central locations so that the product can be shared or distributed across a heterogeneous group of users. However, recent events underscore the need for improving the security of data stored in such untrusted servers or databases. Advances in cryptographic techniques and database technologies provide the necessary security functionality but rely on a computational model in which the cloud is used solely for storage and retrieval. Much of big data computation and analytics make use of signal processing fundamentals for computation. As the trend of moving data storage and computation to the cloud increases, homeland security missions should understand the impact of security on key signal processing kernels such as correlation or thresholding. In this article, we propose a tool called Computing on Masked Data (CMD), which combines advances in database technologies and cryptographic tools to provide a low overhead mechanism to offload certain mathematical operations securely to the cloud. This article describes the design and development of the CMD tool.
△ Less
Submitted 6 April, 2015;
originally announced April 2015.
-
Hiding Symbols and Functions: New Metrics and Constructions for Information-Theoretic Security
Authors:
Flavio du Pin Calmon,
Muriel Médard,
Mayank Varia,
Ken R. Duffy,
Mark M. Christiansen,
Linda M. Zeger
Abstract:
We present information-theoretic definitions and results for analyzing symmetric-key encryption schemes beyond the perfect secrecy regime, i.e. when perfect secrecy is not attained. We adopt two lines of analysis, one based on lossless source coding, and another akin to rate-distortion theory. We start by presenting a new information-theoretic metric for security, called symbol secrecy, and derive…
▽ More
We present information-theoretic definitions and results for analyzing symmetric-key encryption schemes beyond the perfect secrecy regime, i.e. when perfect secrecy is not attained. We adopt two lines of analysis, one based on lossless source coding, and another akin to rate-distortion theory. We start by presenting a new information-theoretic metric for security, called symbol secrecy, and derive associated fundamental bounds. We then introduce list-source codes (LSCs), which are a general framework for map** a key length (entropy) to a list size that an eavesdropper has to resolve in order to recover a secret message. We provide explicit constructions of LSCs, and demonstrate that, when the source is uniformly distributed, the highest level of symbol secrecy for a fixed key length can be achieved through a construction based on minimum-distance separable (MDS) codes. Using an analysis related to rate-distortion theory, we then show how symbol secrecy can be used to determine the probability that an eavesdropper correctly reconstructs functions of the original plaintext. We illustrate how these bounds can be applied to characterize security properties of symmetric-key encryption schemes, and, in particular, extend security claims based on symbol secrecy to a functional setting.
△ Less
Submitted 29 March, 2015;
originally announced March 2015.
-
Computing on Masked Data: a High Performance Method for Improving Big Data Veracity
Authors:
Jeremy Kepner,
Vijay Gadepally,
Pete Michaleas,
Nabil Schear,
Mayank Varia,
Arkady Yerukhimovich,
Robert K. Cunningham
Abstract:
The growing gap between data and users calls for innovative tools that address the challenges faced by big data volume, velocity and variety. Along with these standard three V's of big data, an emerging fourth "V" is veracity, which addresses the confidentiality, integrity, and availability of the data. Traditional cryptographic techniques that ensure the veracity of data can have overheads that a…
▽ More
The growing gap between data and users calls for innovative tools that address the challenges faced by big data volume, velocity and variety. Along with these standard three V's of big data, an emerging fourth "V" is veracity, which addresses the confidentiality, integrity, and availability of the data. Traditional cryptographic techniques that ensure the veracity of data can have overheads that are too large to apply to big data. This work introduces a new technique called Computing on Masked Data (CMD), which improves data veracity by allowing computations to be performed directly on masked data and ensuring that only authorized recipients can unmask the data. Using the sparse linear algebra of associative arrays, CMD can be performed with significantly less overhead than other approaches while still supporting a wide range of linear algebraic operations on the masked data. Databases with strong support of sparse operations, such as SciDB or Apache Accumulo, are ideally suited to this technique. Examples are shown for the application of CMD to a complex DNA matching algorithm and to database operations over social media data.
△ Less
Submitted 22 June, 2014;
originally announced June 2014.
-
An Exploration of the Role of Principal Inertia Components in Information Theory
Authors:
Flavio du Pin Calmon,
Mayank Varia,
Muriel Médard
Abstract:
The principal inertia components of the joint distribution of two random variables $X$ and $Y$ are inherently connected to how an observation of $Y$ is statistically related to a hidden variable $X$. In this paper, we explore this connection within an information theoretic framework. We show that, under certain symmetry conditions, the principal inertia components play an important role in estimat…
▽ More
The principal inertia components of the joint distribution of two random variables $X$ and $Y$ are inherently connected to how an observation of $Y$ is statistically related to a hidden variable $X$. In this paper, we explore this connection within an information theoretic framework. We show that, under certain symmetry conditions, the principal inertia components play an important role in estimating one-bit functions of $X$, namely $f(X)$, given an observation of $Y$. In particular, the principal inertia components bear an interpretation as filter coefficients in the linear transformation of $p_{f(X)|X}$ into $p_{f(X)|Y}$. This interpretation naturally leads to the conjecture that the mutual information between $f(X)$ and $Y$ is maximized when all the principal inertia components have equal value. We also study the role of the principal inertia components in the Markov chain $B\rightarrow X\rightarrow Y\rightarrow \widehat{B}$, where $B$ and $\widehat{B}$ are binary random variables. We illustrate our results for the setting where $X$ and $Y$ are binary strings and $Y$ is the result of sending $X$ through an additive noise binary channel.
△ Less
Submitted 6 May, 2014;
originally announced May 2014.
-
Bounds on inference
Authors:
Flavio du Pin Calmon,
Mayank Varia,
Muriel Médard,
Mark M. Christiansen,
Ken R. Duffy,
Stefano Tessaro
Abstract:
Lower bounds for the average probability of error of estimating a hidden variable X given an observation of a correlated random variable Y, and Fano's inequality in particular, play a central role in information theory. In this paper, we present a lower bound for the average estimation error based on the marginal distribution of X and the principal inertias of the joint distribution matrix of X an…
▽ More
Lower bounds for the average probability of error of estimating a hidden variable X given an observation of a correlated random variable Y, and Fano's inequality in particular, play a central role in information theory. In this paper, we present a lower bound for the average estimation error based on the marginal distribution of X and the principal inertias of the joint distribution matrix of X and Y. Furthermore, we discuss an information measure based on the sum of the largest principal inertias, called k-correlation, which generalizes maximal correlation. We show that k-correlation satisfies the Data Processing Inequality and is convex in the conditional distribution of Y given X. Finally, we investigate how to answer a fundamental question in inference and privacy: given an observation Y, can we estimate a function f(X) of the hidden random variable X with an average error below a certain threshold? We provide a general method for answering this question using an approach based on rate-distortion theory.
△ Less
Submitted 5 October, 2013;
originally announced October 2013.