-
Generalized Core Spanner Inexpressibility via Ehrenfeucht-Fraïssé Games for FC
Authors:
Sam M. Thompson,
Dominik D. Freydenberger
Abstract:
Despite considerable research on document spanners, little is known about the expressive power of generalized core spanners. In this paper, we use Ehrenfeucht-Fraïssé games to obtain general inexpressibility lemmas for the logic FC (a finite-model variant of the theory of concatenation). Applying these lemmas give inexpressibility results for FC that we lift to generalized core spanners. In partic…
▽ More
Despite considerable research on document spanners, little is known about the expressive power of generalized core spanners. In this paper, we use Ehrenfeucht-Fraïssé games to obtain general inexpressibility lemmas for the logic FC (a finite-model variant of the theory of concatenation). Applying these lemmas give inexpressibility results for FC that we lift to generalized core spanners. In particular, we give several relations that cannot be selected by generalized core spanners, thus demonstrating the effectiveness of the inexpressibility lemmas. As an immediate consequence, we also gain new insights into the expressive power of core spanners.
△ Less
Submitted 28 June, 2023;
originally announced June 2023.
-
Splitting Spanner Atoms: A Tool for Acyclic Core Spanners
Authors:
Dominik D. Freydenberger,
Sam M. Thompson
Abstract:
This paper investigates regex CQs with string equalities (SERCQs), a subclass of core spanners. As shown by Freydenberger, Kimelfeld, and Peterfreund (PODS 2018), these queries are intractable, even if restricted to acyclic queries. This previous result defines acyclicity by treating regex formulas as atoms. In contrast to this, we propose an alternative definition by converting SERCQs into FC-CQs…
▽ More
This paper investigates regex CQs with string equalities (SERCQs), a subclass of core spanners. As shown by Freydenberger, Kimelfeld, and Peterfreund (PODS 2018), these queries are intractable, even if restricted to acyclic queries. This previous result defines acyclicity by treating regex formulas as atoms. In contrast to this, we propose an alternative definition by converting SERCQs into FC-CQs -- conjunctive queries in FC, a logic that is based on word equations. We introduce a way to decompose word equations of unbounded arity into a conjunction of binary word equations. If the result of the decomposition is acyclic, then evaluation and enumeration of results become tractable. The main result of this work is an algorithm that decides in polynomial time whether an FC-CQ can be decomposed into an acyclic FC-CQ. We also give an efficient conversion from synchronized SERCQs to FC-CQs with regular constraints. As a consequence, tractability results for acyclic relational CQs directly translate to a large class of SERCQs.
△ Less
Submitted 19 January, 2022; v1 submitted 10 April, 2021;
originally announced April 2021.
-
The theory of concatenation over finite models
Authors:
Dominik D. Freydenberger,
Liat Peterfreund
Abstract:
We propose FC, a new logic on words that combines finite model theory with the theory of concatenation - a first-order logic that is based on word equations. Like the theory of concatenation, FC is built around word equations; in contrast to it, its semantics are defined to only allow finite models, by limiting the universe to a word and all its factors. As a consequence of this, FC has many of th…
▽ More
We propose FC, a new logic on words that combines finite model theory with the theory of concatenation - a first-order logic that is based on word equations. Like the theory of concatenation, FC is built around word equations; in contrast to it, its semantics are defined to only allow finite models, by limiting the universe to a word and all its factors. As a consequence of this, FC has many of the desirable properties of FO on finite models, while being far more expressive than FO[<]. Most noteworthy among these desirable properties are sufficient criteria for efficient model checking, and capturing various complexity classes by adding operators for transitive closures or fixed points.
Not only does FC allow us to obtain new insights and techniques for expressive power and efficient evaluation of document spanners, but it also provides a general framework for logic on words that also has potential applications in other areas.
△ Less
Submitted 13 May, 2021; v1 submitted 12 December, 2019;
originally announced December 2019.
-
Dynamic Complexity of Document Spanners
Authors:
Dominik D. Freydenberger,
Sam M. Thompson
Abstract:
The present paper investigates the dynamic complexity of document spanners, a formal framework for information extraction introduced by Fagin, Kimelfeld, Reiss, and Vansummeren (JACM 2015). We first look at the class of regular spanners and prove that any regular spanner can be maintained in the dynamic complexity class DynPROP. This result follows from work done previously on the dynamic complexi…
▽ More
The present paper investigates the dynamic complexity of document spanners, a formal framework for information extraction introduced by Fagin, Kimelfeld, Reiss, and Vansummeren (JACM 2015). We first look at the class of regular spanners and prove that any regular spanner can be maintained in the dynamic complexity class DynPROP. This result follows from work done previously on the dynamic complexity of formal languages by Gelade, Marquardt, and Schwentick (TOCL 2012).
To investigate core spanners we use SpLog, a concatenation logic that exactly captures core spanners. We show that the dynamic complexity class DynCQ, is more expressive than SpLog and therefore can maintain any core spanner. This result is then extended to show that DynFO can maintain any generalized core spanner and that DynFO is at least as powerful as SpLog with negation.
△ Less
Submitted 9 January, 2020; v1 submitted 24 September, 2019;
originally announced September 2019.
-
Complexity Bounds for Relational Algebra over Document Spanners
Authors:
Liat Peterfreund,
Dominik D. Freydenberger,
Benny Kimelfeld,
Markus Kröll
Abstract:
We investigate the complexity of evaluating queries in Relational Algebra (RA) over the relations extracted by regex formulas (i.e., regular expressions with capture variables) over text documents. Such queries, also known as the regular document spanners, were shown to have an evaluation with polynomial delay for every positive RA expression (i.e., consisting of only natural joins, projections an…
▽ More
We investigate the complexity of evaluating queries in Relational Algebra (RA) over the relations extracted by regex formulas (i.e., regular expressions with capture variables) over text documents. Such queries, also known as the regular document spanners, were shown to have an evaluation with polynomial delay for every positive RA expression (i.e., consisting of only natural joins, projections and unions); here, the RA expression is fixed and the input consists of both the regex formulas and the document. In this work, we explore the implication of two fundamental generalizations. The first is adopting the "schemaless" semantics for spanners, as proposed and studied by Maturana et al. The second is going beyond the positive RA to allowing the difference operator. We show that each of the two generalizations introduces computational hardness: it is intractable to compute the natural join of two regex formulas under the schemaless semantics, and the difference between two regex formulas under both the ordinary and schemaless semantics. Nevertheless, we propose and analyze syntactic constraints, on the RA expression and the regex formulas at hand, such that the expressive power is fully preserved and, yet, evaluation can be done with polynomial delay. Unlike the previous work on RA over regex formulas, our technique is not (and provably cannot be) based on the static compilation of regex formulas, but rather on an ad-hoc compilation into an automaton that incorporates both the query and the document. This approach also allows us to include black-box extractors in the RA expression.
△ Less
Submitted 6 February, 2019; v1 submitted 14 January, 2019;
originally announced January 2019.
-
Deterministic Regular Expressions With Back-References
Authors:
Dominik D. Freydenberger,
Markus L. Schmid
Abstract:
Most modern libraries for regular expression matching allow back-references (i.e., repetition operators) that substantially increase expressive power, but also lead to intractability. In order to find a better balance between expressiveness and tractability, we combine these with the notion of determinism for regular expressions used in XML DTDs and XML Schema. This includes the definition of a su…
▽ More
Most modern libraries for regular expression matching allow back-references (i.e., repetition operators) that substantially increase expressive power, but also lead to intractability. In order to find a better balance between expressiveness and tractability, we combine these with the notion of determinism for regular expressions used in XML DTDs and XML Schema. This includes the definition of a suitable automaton model, and a generalization of the Glushkov construction. We demonstrate that, compared to their non-deterministic superclass, these deterministic regular expressions with back-references have desirable algorithmic properties (i.e., efficiently solvable membership problem and some decidable problems in static analysis), while, at the same time, their expressive power exceeds that of deterministic regular expressions without back-references.
△ Less
Submitted 5 February, 2018;
originally announced February 2018.
-
Joining Extractions of Regular Expressions
Authors:
Dominik D. Freydenberger,
Benny Kimelfeld,
Liat Peterfreund
Abstract:
Regular expressions with capture variables, also known as "regex formulas," extract relations of spans (interval positions) from text. These relations can be further manipulated via Relational Algebra as studied in the context of document spanners, Fagin et al.'s formal framework for information extraction. We investigate the complexity of querying text by Conjunctive Queries (CQs) and Unions of C…
▽ More
Regular expressions with capture variables, also known as "regex formulas," extract relations of spans (interval positions) from text. These relations can be further manipulated via Relational Algebra as studied in the context of document spanners, Fagin et al.'s formal framework for information extraction. We investigate the complexity of querying text by Conjunctive Queries (CQs) and Unions of CQs (UCQs) on top of regex formulas. We show that the lower bounds (NP-completeness and W[1]-hardness) from the relational world also hold in our setting; in particular, hardness hits already single-character text! Yet, the upper bounds from the relational world do not carry over. Unlike the relational world, acyclic CQs, and even gamma-acyclic CQs, are hard to compute. The source of hardness is that it may be intractable to instantiate the relation defined by a regex formula, simply because it has an exponential number of tuples. Yet, we are able to establish general upper bounds. In particular, UCQs can be evaluated with polynomial delay, provided that every CQ has a bounded number of atoms (while unions and projection can be arbitrary). Furthermore, UCQ evaluation is solvable with FPT (Fixed-Parameter Tractable) delay when the parameter is the size of the UCQ.
△ Less
Submitted 30 March, 2017;
originally announced March 2017.
-
Testing k-binomial equivalence
Authors:
Dominik D. Freydenberger,
Pawel Gawrychowski,
Juhani Karhumäki,
Florin Manea,
Wojciech Rytter
Abstract:
Two words $w_1$ and $w_2$ are said to be $k$-binomial equivalent if every non-empty word $x$ of length at most $k$ over the alphabet of $w_1$ and $w_2$ appears as a scattered factor of $w_1$ exactly as many times as it appears as a scattered factor of $w_2$. We give two different polynomial-time algorithms testing the $k$-binomial equivalence of two words. The first one is deterministic (but the d…
▽ More
Two words $w_1$ and $w_2$ are said to be $k$-binomial equivalent if every non-empty word $x$ of length at most $k$ over the alphabet of $w_1$ and $w_2$ appears as a scattered factor of $w_1$ exactly as many times as it appears as a scattered factor of $w_2$. We give two different polynomial-time algorithms testing the $k$-binomial equivalence of two words. The first one is deterministic (but the degree of the corresponding polynomial is too high) and the second one is randomised (it is more direct and more efficient). These are the first known algorithms for the problem which run in polynomial time.
△ Less
Submitted 12 October, 2015; v1 submitted 2 September, 2015;
originally announced September 2015.