On the Relative Trust between Inconsistent Data and Inaccurate Constraints
Authors:
George Beskales,
Ihab F. Ilyas,
Lukasz Golab,
Artur Galiullin
Abstract:
Functional dependencies (FDs) specify the intended data semantics while violations of FDs indicate deviation from these semantics. In this paper, we study a data cleaning problem in which the FDs may not be completely correct, e.g., due to data evolution or incomplete knowledge of the data semantics. We argue that the notion of relative trust is a crucial aspect of this problem: if the FDs are out…
▽ More
Functional dependencies (FDs) specify the intended data semantics while violations of FDs indicate deviation from these semantics. In this paper, we study a data cleaning problem in which the FDs may not be completely correct, e.g., due to data evolution or incomplete knowledge of the data semantics. We argue that the notion of relative trust is a crucial aspect of this problem: if the FDs are outdated, we should modify them to fit the data, but if we suspect that there are problems with the data, we should modify the data to fit the FDs. In practice, it is usually unclear how much to trust the data versus the FDs. To address this problem, we propose an algorithm for generating non-redundant solutions (i.e., simultaneous modifications of the data and the FDs) corresponding to various levels of relative trust. This can help users determine the best way to modify their data and/or FDs to achieve consistency.
△ Less
Submitted 24 July, 2012; v1 submitted 22 July, 2012;
originally announced July 2012.
Factorization-based Lossless Compression of Inverted Indices
Authors:
George Beskales,
Marcus Fontoura,
Maxim Gurevich,
Sergei Vassilvitskii,
Vanja Josifovski
Abstract:
Many large-scale Web applications that require ranked top-k retrieval such as Web search and online advertising are implemented using inverted indices. An inverted index represents a sparse term-document matrix, where non-zero elements indicate the strength of term-document association. In this work, we present an approach for lossless compression of inverted indices. Our approach maps terms in a…
▽ More
Many large-scale Web applications that require ranked top-k retrieval such as Web search and online advertising are implemented using inverted indices. An inverted index represents a sparse term-document matrix, where non-zero elements indicate the strength of term-document association. In this work, we present an approach for lossless compression of inverted indices. Our approach maps terms in a document corpus to a new term space in order to reduce the number of non-zero elements in the term-document matrix, resulting in a more compact inverted index. We formulate the problem of selecting a new term space that minimizes the resulting index size as a matrix factorization problem, and prove that finding the optimal factorization is an NP-hard problem. We develop a greedy algorithm for finding an approximate solution. A side effect of our approach is increasing the number of terms in the index, which may negatively affect query evaluation performance. To eliminate such effect, we develop a methodology for modifying query evaluation algorithms by exploiting specific properties of our compression approach. Our experimental evaluation demonstrates that our approach achieves an index size reduction of 20%, while maintaining the same query response times. Higher compression ratios up to 35% are achievable, however at the cost of slightly longer query response times. Furthermore, combining our approach with other lossless compression techniques, namely variable-byte encoding, leads to index size reduction of up to 50%.
△ Less
Submitted 9 August, 2011;
originally announced August 2011.