Search | arXiv e-print repository

Generalized Core Spanner Inexpressibility via Ehrenfeucht-Fraïssé Games for FC

Authors: Sam M. Thompson, Dominik D. Freydenberger

Abstract: Despite considerable research on document spanners, little is known about the expressive power of generalized core spanners. In this paper, we use Ehrenfeucht-Fraïssé games to obtain general inexpressibility lemmas for the logic FC (a finite-model variant of the theory of concatenation). Applying these lemmas give inexpressibility results for FC that we lift to generalized core spanners. In partic… ▽ More Despite considerable research on document spanners, little is known about the expressive power of generalized core spanners. In this paper, we use Ehrenfeucht-Fraïssé games to obtain general inexpressibility lemmas for the logic FC (a finite-model variant of the theory of concatenation). Applying these lemmas give inexpressibility results for FC that we lift to generalized core spanners. In particular, we give several relations that cannot be selected by generalized core spanners, thus demonstrating the effectiveness of the inexpressibility lemmas. As an immediate consequence, we also gain new insights into the expressive power of core spanners. △ Less

Submitted 28 June, 2023; originally announced June 2023.

arXiv:2104.04758 [pdf, other]

Splitting Spanner Atoms: A Tool for Acyclic Core Spanners

Authors: Dominik D. Freydenberger, Sam M. Thompson

Abstract: This paper investigates regex CQs with string equalities (SERCQs), a subclass of core spanners. As shown by Freydenberger, Kimelfeld, and Peterfreund (PODS 2018), these queries are intractable, even if restricted to acyclic queries. This previous result defines acyclicity by treating regex formulas as atoms. In contrast to this, we propose an alternative definition by converting SERCQs into FC-CQs… ▽ More This paper investigates regex CQs with string equalities (SERCQs), a subclass of core spanners. As shown by Freydenberger, Kimelfeld, and Peterfreund (PODS 2018), these queries are intractable, even if restricted to acyclic queries. This previous result defines acyclicity by treating regex formulas as atoms. In contrast to this, we propose an alternative definition by converting SERCQs into FC-CQs -- conjunctive queries in FC, a logic that is based on word equations. We introduce a way to decompose word equations of unbounded arity into a conjunction of binary word equations. If the result of the decomposition is acyclic, then evaluation and enumeration of results become tractable. The main result of this work is an algorithm that decides in polynomial time whether an FC-CQ can be decomposed into an acyclic FC-CQ. We also give an efficient conversion from synchronized SERCQs to FC-CQs with regular constraints. As a consequence, tractability results for acyclic relational CQs directly translate to a large class of SERCQs. △ Less

Submitted 19 January, 2022; v1 submitted 10 April, 2021; originally announced April 2021.

arXiv:1912.06110 [pdf, other]

The theory of concatenation over finite models

Authors: Dominik D. Freydenberger, Liat Peterfreund

Abstract: We propose FC, a new logic on words that combines finite model theory with the theory of concatenation - a first-order logic that is based on word equations. Like the theory of concatenation, FC is built around word equations; in contrast to it, its semantics are defined to only allow finite models, by limiting the universe to a word and all its factors. As a consequence of this, FC has many of th… ▽ More We propose FC, a new logic on words that combines finite model theory with the theory of concatenation - a first-order logic that is based on word equations. Like the theory of concatenation, FC is built around word equations; in contrast to it, its semantics are defined to only allow finite models, by limiting the universe to a word and all its factors. As a consequence of this, FC has many of the desirable properties of FO on finite models, while being far more expressive than FO[<]. Most noteworthy among these desirable properties are sufficient criteria for efficient model checking, and capturing various complexity classes by adding operators for transitive closures or fixed points. Not only does FC allow us to obtain new insights and techniques for expressive power and efficient evaluation of document spanners, but it also provides a general framework for logic on words that also has potential applications in other areas. △ Less

Submitted 13 May, 2021; v1 submitted 12 December, 2019; originally announced December 2019.

Comments: Update to make this version consistent with conference version (ICALP 2021), which renamed Datasplog to FC-Datalog

arXiv:1909.10869 [pdf, other]

Dynamic Complexity of Document Spanners

Authors: Dominik D. Freydenberger, Sam M. Thompson

Abstract: The present paper investigates the dynamic complexity of document spanners, a formal framework for information extraction introduced by Fagin, Kimelfeld, Reiss, and Vansummeren (JACM 2015). We first look at the class of regular spanners and prove that any regular spanner can be maintained in the dynamic complexity class DynPROP. This result follows from work done previously on the dynamic complexi… ▽ More The present paper investigates the dynamic complexity of document spanners, a formal framework for information extraction introduced by Fagin, Kimelfeld, Reiss, and Vansummeren (JACM 2015). We first look at the class of regular spanners and prove that any regular spanner can be maintained in the dynamic complexity class DynPROP. This result follows from work done previously on the dynamic complexity of formal languages by Gelade, Marquardt, and Schwentick (TOCL 2012). To investigate core spanners we use SpLog, a concatenation logic that exactly captures core spanners. We show that the dynamic complexity class DynCQ, is more expressive than SpLog and therefore can maintain any core spanner. This result is then extended to show that DynFO can maintain any generalized core spanner and that DynFO is at least as powerful as SpLog with negation. △ Less

Submitted 9 January, 2020; v1 submitted 24 September, 2019; originally announced September 2019.

arXiv:1901.04182 [pdf, ps, other]

Complexity Bounds for Relational Algebra over Document Spanners

Authors: Liat Peterfreund, Dominik D. Freydenberger, Benny Kimelfeld, Markus Kröll

Abstract: We investigate the complexity of evaluating queries in Relational Algebra (RA) over the relations extracted by regex formulas (i.e., regular expressions with capture variables) over text documents. Such queries, also known as the regular document spanners, were shown to have an evaluation with polynomial delay for every positive RA expression (i.e., consisting of only natural joins, projections an… ▽ More We investigate the complexity of evaluating queries in Relational Algebra (RA) over the relations extracted by regex formulas (i.e., regular expressions with capture variables) over text documents. Such queries, also known as the regular document spanners, were shown to have an evaluation with polynomial delay for every positive RA expression (i.e., consisting of only natural joins, projections and unions); here, the RA expression is fixed and the input consists of both the regex formulas and the document. In this work, we explore the implication of two fundamental generalizations. The first is adopting the "schemaless" semantics for spanners, as proposed and studied by Maturana et al. The second is going beyond the positive RA to allowing the difference operator. We show that each of the two generalizations introduces computational hardness: it is intractable to compute the natural join of two regex formulas under the schemaless semantics, and the difference between two regex formulas under both the ordinary and schemaless semantics. Nevertheless, we propose and analyze syntactic constraints, on the RA expression and the regex formulas at hand, such that the expressive power is fully preserved and, yet, evaluation can be done with polynomial delay. Unlike the previous work on RA over regex formulas, our technique is not (and provably cannot be) based on the static compilation of regex formulas, but rather on an ad-hoc compilation into an automaton that incorporates both the query and the document. This approach also allows us to include black-box extractors in the RA expression. △ Less

Submitted 6 February, 2019; v1 submitted 14 January, 2019; originally announced January 2019.

arXiv:1802.01508 [pdf, ps, other]

Deterministic Regular Expressions With Back-References

Authors: Dominik D. Freydenberger, Markus L. Schmid

Abstract: Most modern libraries for regular expression matching allow back-references (i.e., repetition operators) that substantially increase expressive power, but also lead to intractability. In order to find a better balance between expressiveness and tractability, we combine these with the notion of determinism for regular expressions used in XML DTDs and XML Schema. This includes the definition of a su… ▽ More Most modern libraries for regular expression matching allow back-references (i.e., repetition operators) that substantially increase expressive power, but also lead to intractability. In order to find a better balance between expressiveness and tractability, we combine these with the notion of determinism for regular expressions used in XML DTDs and XML Schema. This includes the definition of a suitable automaton model, and a generalization of the Glushkov construction. We demonstrate that, compared to their non-deterministic superclass, these deterministic regular expressions with back-references have desirable algorithmic properties (i.e., efficiently solvable membership problem and some decidable problems in static analysis), while, at the same time, their expressive power exceeds that of deterministic regular expressions without back-references. △ Less

Submitted 5 February, 2018; originally announced February 2018.

MSC Class: 68Q45; 68Q19; 68Q17

arXiv:1703.10350 [pdf, ps, other]

Joining Extractions of Regular Expressions

Authors: Dominik D. Freydenberger, Benny Kimelfeld, Liat Peterfreund

Abstract: Regular expressions with capture variables, also known as "regex formulas," extract relations of spans (interval positions) from text. These relations can be further manipulated via Relational Algebra as studied in the context of document spanners, Fagin et al.'s formal framework for information extraction. We investigate the complexity of querying text by Conjunctive Queries (CQs) and Unions of C… ▽ More Regular expressions with capture variables, also known as "regex formulas," extract relations of spans (interval positions) from text. These relations can be further manipulated via Relational Algebra as studied in the context of document spanners, Fagin et al.'s formal framework for information extraction. We investigate the complexity of querying text by Conjunctive Queries (CQs) and Unions of CQs (UCQs) on top of regex formulas. We show that the lower bounds (NP-completeness and W[1]-hardness) from the relational world also hold in our setting; in particular, hardness hits already single-character text! Yet, the upper bounds from the relational world do not carry over. Unlike the relational world, acyclic CQs, and even gamma-acyclic CQs, are hard to compute. The source of hardness is that it may be intractable to instantiate the relation defined by a regex formula, simply because it has an exponential number of tuples. Yet, we are able to establish general upper bounds. In particular, UCQs can be evaluated with polynomial delay, provided that every CQ has a bounded number of atoms (while unions and projection can be arbitrary). Furthermore, UCQ evaluation is solvable with FPT (Fixed-Parameter Tractable) delay when the parameter is the size of the UCQ. △ Less

Submitted 30 March, 2017; originally announced March 2017.

arXiv:1509.00622 [pdf, ps, other]

Testing k-binomial equivalence

Authors: Dominik D. Freydenberger, Pawel Gawrychowski, Juhani Karhumäki, Florin Manea, Wojciech Rytter

Abstract: Two words $w_1$ and $w_2$ are said to be $k$-binomial equivalent if every non-empty word $x$ of length at most $k$ over the alphabet of $w_1$ and $w_2$ appears as a scattered factor of $w_1$ exactly as many times as it appears as a scattered factor of $w_2$. We give two different polynomial-time algorithms testing the $k$-binomial equivalence of two words. The first one is deterministic (but the d… ▽ More Two words $w_1$ and $w_2$ are said to be $k$-binomial equivalent if every non-empty word $x$ of length at most $k$ over the alphabet of $w_1$ and $w_2$ appears as a scattered factor of $w_1$ exactly as many times as it appears as a scattered factor of $w_2$. We give two different polynomial-time algorithms testing the $k$-binomial equivalence of two words. The first one is deterministic (but the degree of the corresponding polynomial is too high) and the second one is randomised (it is more direct and more efficient). These are the first known algorithms for the problem which run in polynomial time. △ Less

Submitted 12 October, 2015; v1 submitted 2 September, 2015; originally announced September 2015.

Journal ref: "Multidisciplinary Creativity: homage to Gheorghe Paun on his 65th birthday", Pg. 239--248, Ed. Spandugino, Bucharest, Romania, ISBN: 978-606-8401-63-8, 2015

Showing 1–8 of 8 results for author: Freydenberger, D D