-
Pseudorandom Error-Correcting Codes
Authors:
Miranda Christ,
Sam Gunn
Abstract:
We construct pseudorandom error-correcting codes (or simply pseudorandom codes), which are error-correcting codes with the property that any polynomial number of codewords are pseudorandom to any computationally-bounded adversary. Efficient decoding of corrupted codewords is possible with the help of a decoding key.
We build pseudorandom codes that are robust to substitution and deletion errors,…
▽ More
We construct pseudorandom error-correcting codes (or simply pseudorandom codes), which are error-correcting codes with the property that any polynomial number of codewords are pseudorandom to any computationally-bounded adversary. Efficient decoding of corrupted codewords is possible with the help of a decoding key.
We build pseudorandom codes that are robust to substitution and deletion errors, where pseudorandomness rests on standard cryptographic assumptions. Specifically, pseudorandomness is based on either $2^{O(\sqrt{n})}$-hardness of LPN, or polynomial hardness of LPN and the planted XOR problem at low density.
As our primary application of pseudorandom codes, we present an undetectable watermarking scheme for outputs of language models that is robust to crop** and a constant rate of random substitutions and deletions. The watermark is undetectable in the sense that any number of samples of watermarked text are computationally indistinguishable from text output by the original model. This is the first undetectable watermarking scheme that can tolerate a constant rate of errors.
Our second application is to steganography, where a secret message is hidden in innocent-looking content. We present a constant-rate stateless steganography scheme with robustness to a constant rate of substitutions. Ours is the first stateless steganography scheme with provable steganographic security and any robustness to errors.
△ Less
Submitted 17 June, 2024; v1 submitted 14 February, 2024;
originally announced February 2024.
-
The Impact of De-Identification on Single-Year-of-Age Counts in the U.S. Census
Authors:
Sarah Radway,
Miranda Christ
Abstract:
In 2020, the U.S. Census Bureau transitioned from data swap** to differential privacy (DP) in its approach to de-identifying decennial census data. This decision has faced considerable criticism from data users, particularly due to concerns about the accuracy of DP. We compare the relative impacts of swap** and DP on census data, focusing on the use case of school planning, where single-year-o…
▽ More
In 2020, the U.S. Census Bureau transitioned from data swap** to differential privacy (DP) in its approach to de-identifying decennial census data. This decision has faced considerable criticism from data users, particularly due to concerns about the accuracy of DP. We compare the relative impacts of swap** and DP on census data, focusing on the use case of school planning, where single-year-of-age population counts (i.e., the number of four-year-olds in the district) are used to estimate the number of incoming students and make resulting decisions surrounding faculty, classrooms, and funding requests. We examine these impacts for school districts of varying population sizes and age distributions.
Our findings support the use of DP over swap** for single-year-of-age counts; in particular, concerning behaviors associated with DP (namely, poor behavior for smaller districts) occur with swap** mechanisms as well. For the school planning use cases we investigate, DP provides comparable, if not improved, accuracy over swap**, while offering other benefits such as improved transparency.
△ Less
Submitted 24 August, 2023;
originally announced August 2023.
-
Undetectable Watermarks for Language Models
Authors:
Miranda Christ,
Sam Gunn,
Or Zamir
Abstract:
Recent advances in the capabilities of large language models such as GPT-4 have spurred increasing concern about our ability to detect AI-generated text. Prior works have suggested methods of embedding watermarks in model outputs, by noticeably altering the output distribution. We ask: Is it possible to introduce a watermark without incurring any detectable change to the output distribution?
To…
▽ More
Recent advances in the capabilities of large language models such as GPT-4 have spurred increasing concern about our ability to detect AI-generated text. Prior works have suggested methods of embedding watermarks in model outputs, by noticeably altering the output distribution. We ask: Is it possible to introduce a watermark without incurring any detectable change to the output distribution?
To this end we introduce a cryptographically-inspired notion of undetectable watermarks for language models. That is, watermarks can be detected only with the knowledge of a secret key; without the secret key, it is computationally intractable to distinguish watermarked outputs from those of the original model. In particular, it is impossible for a user to observe any degradation in the quality of the text. Crucially, watermarks should remain undetectable even when the user is allowed to adaptively query the model with arbitrarily chosen prompts. We construct undetectable watermarks based on the existence of one-way functions, a standard assumption in cryptography.
△ Less
Submitted 24 May, 2023;
originally announced June 2023.
-
The Smoothed Complexity of Policy Iteration for Markov Decision Processes
Authors:
Miranda Christ,
Mihalis Yannakakis
Abstract:
We show subexponential lower bounds (i.e., $2^{Ω(n^c)}$) on the smoothed complexity of the classical Howard's Policy Iteration algorithm for Markov Decision Processes. The bounds hold for the total reward and the average reward criteria. The constructions are robust in the sense that the subexponential bound holds not only on the average for independent random perturbations of the MDP parameters (…
▽ More
We show subexponential lower bounds (i.e., $2^{Ω(n^c)}$) on the smoothed complexity of the classical Howard's Policy Iteration algorithm for Markov Decision Processes. The bounds hold for the total reward and the average reward criteria. The constructions are robust in the sense that the subexponential bound holds not only on the average for independent random perturbations of the MDP parameters (transition probabilities and rewards), but for all arbitrary perturbations within an inverse polynomial range. We show also an exponential lower bound on the worst-case complexity for the simple reachability objective.
△ Less
Submitted 30 November, 2022;
originally announced December 2022.
-
Distributed and parallel time series feature extraction for industrial big data applications
Authors:
Maximilian Christ,
Andreas W. Kempa-Liehr,
Michael Feindt
Abstract:
The all-relevant problem of feature selection is the identification of all strongly and weakly relevant attributes. This problem is especially hard to solve for time series classification and regression in industrial applications such as predictive maintenance or production line optimization, for which each label or regression target is associated with several time series and meta-information simu…
▽ More
The all-relevant problem of feature selection is the identification of all strongly and weakly relevant attributes. This problem is especially hard to solve for time series classification and regression in industrial applications such as predictive maintenance or production line optimization, for which each label or regression target is associated with several time series and meta-information simultaneously. Here, we are proposing an efficient, scalable feature extraction algorithm for time series, which filters the available features in an early stage of the machine learning pipeline with respect to their significance for the classification or regression task, while controlling the expected percentage of selected but irrelevant features. The proposed algorithm combines established feature extraction methods with a feature importance filter. It has a low computational complexity, allows to start on a problem with only limited domain knowledge available, can be trivially parallelized, is highly scalable and based on well studied non-parametric hypothesis tests. We benchmark our proposed algorithm on all binary classification problems of the UCR time series classification archive as well as time series from a production line optimization project and simulated stochastic processes with underlying qualitative change of dynamics.
△ Less
Submitted 19 May, 2017; v1 submitted 24 October, 2016;
originally announced October 2016.
-
Online Multi-Coloring with Advice
Authors:
Marie G. Christ,
Lene M. Favrholdt,
Kim S. Larsen
Abstract:
We consider the problem of online graph multi-coloring with advice. Multi-coloring is often used to model frequency allocation in cellular networks. We give several nearly tight upper and lower bounds for the most standard topologies of cellular networks, paths and hexagonal graphs. For the path, negative results trivially carry over to bipartite graphs, and our positive results are also valid for…
▽ More
We consider the problem of online graph multi-coloring with advice. Multi-coloring is often used to model frequency allocation in cellular networks. We give several nearly tight upper and lower bounds for the most standard topologies of cellular networks, paths and hexagonal graphs. For the path, negative results trivially carry over to bipartite graphs, and our positive results are also valid for bipartite graphs. The advice given represents information that is likely to be available, studying for instance the data from earlier similar periods of time.
△ Less
Submitted 5 September, 2014;
originally announced September 2014.
-
Online Bin Covering: Expectations vs. Guarantees
Authors:
Marie G. Christ,
Lene M. Favrholdt,
Kim S. Larsen
Abstract:
Bin covering is a dual version of classic bin packing. Thus, the goal is to cover as many bins as possible, where covering a bin means packing items of total size at least one in the bin.
For online bin covering, competitive analysis fails to distinguish between most algorithms of interest; all "reasonable" algorithms have a competitive ratio of 1/2. Thus, in order to get a better understanding…
▽ More
Bin covering is a dual version of classic bin packing. Thus, the goal is to cover as many bins as possible, where covering a bin means packing items of total size at least one in the bin.
For online bin covering, competitive analysis fails to distinguish between most algorithms of interest; all "reasonable" algorithms have a competitive ratio of 1/2. Thus, in order to get a better understanding of the combinatorial difficulties in solving this problem, we turn to other performance measures, namely relative worst order, random order, and max/max analysis, as well as analyzing input with restricted or uniformly distributed item sizes. In this way, our study also supplements the ongoing systematic studies of the relative strengths of various performance measures.
Two classic algorithms for online bin packing that have natural dual versions are Harmonic and Next-Fit. Even though the algorithms are quite different in nature, the dual versions are not separated by competitive analysis. We make the case that when guarantees are needed, even under restricted input sequences, dual Harmonic is preferable. In addition, we establish quite robust theoretical results showing that if items come from a uniform distribution or even if just the ordering of items is uniformly random, then dual Next-Fit is the right choice.
△ Less
Submitted 27 February, 2014; v1 submitted 25 September, 2013;
originally announced September 2013.
-
Communication lower bounds and optimal algorithms for programs that reference arrays -- Part 1
Authors:
Michael Christ,
James Demmel,
Nicholas Knight,
Thomas Scanlon,
Katherine Yelick
Abstract:
The movement of data (communication) between levels of a memory hierarchy, or between parallel processors on a network, can greatly dominate the cost of computation, so algorithms that minimize communication are of interest. Motivated by this, attainable lower bounds for the amount of communication required by algorithms were established by several groups for a variety of algorithms, including mat…
▽ More
The movement of data (communication) between levels of a memory hierarchy, or between parallel processors on a network, can greatly dominate the cost of computation, so algorithms that minimize communication are of interest. Motivated by this, attainable lower bounds for the amount of communication required by algorithms were established by several groups for a variety of algorithms, including matrix computations. Prior work of Ballard-Demmel-Holtz-Schwartz relied on a geometric inequality of Loomis and Whitney for this purpose. In this paper the general theory of discrete multilinear Holder-Brascamp-Lieb (HBL) inequalities is used to establish communication lower bounds for a much wider class of algorithms. In some cases, algorithms are presented which attain these lower bounds.
Several contributions are made to the theory of HBL inequalities proper. The optimal constant in such an inequality for torsion-free Abelian groups is shown to equal one whenever it is finite. Bennett-Carbery-Christ-Tao had characterized the tuples of exponents for which such an inequality is valid as the convex polyhedron defined by a certain finite list of inequalities. The problem of constructing an algorithm to decide whether a given inequality is on this list, is shown to be equivalent to Hilbert's Tenth Problem over the rationals, which remains open. Nonetheless, an algorithm which computes the polyhedron itself is constructed.
△ Less
Submitted 31 July, 2013;
originally announced August 2013.