-
Using Color Refinement to Boost Enumeration and Counting for Acyclic CQs of Binary Schemas
Authors:
Cristian Riveros,
Benjamin Scheidt,
Nicole Schweikardt
Abstract:
We present an index structure, called the color-index, to boost the evaluation of acyclic conjunctive queries (ACQs) over binary schemas. The color-index is based on the color refinement algorithm, a widely used subroutine for graph isomorphism testing algorithms. Given a database $D$, we use a suitable version of the color refinement algorithm to produce a stable coloring of $D$, an assignment fr…
▽ More
We present an index structure, called the color-index, to boost the evaluation of acyclic conjunctive queries (ACQs) over binary schemas. The color-index is based on the color refinement algorithm, a widely used subroutine for graph isomorphism testing algorithms. Given a database $D$, we use a suitable version of the color refinement algorithm to produce a stable coloring of $D$, an assignment from the active domain of $D$ to a set of colors $C_D$. The main ingredient of the color-index is a particular database $D_c$ whose active domain is $C_D$ and whose size is at most $|D|$. Using the color-index, we can evaluate any free-connex ACQ $Q$ over $D$ with preprocessing time $O(|Q| \cdot |D_c|)$ and constant delay enumeration. Furthermore, we can also count the number of results of $Q$ over $D$ in time $O(|Q| \cdot |D_c|)$. Given that $|D_c|$ could be much smaller than $|D|$ (even constant-size for some families of databases), the color-index is the first index structure for evaluating free-connex ACQs that allows efficient enumeration and counting with performance that may be strictly smaller than the database size.
△ Less
Submitted 20 May, 2024;
originally announced May 2024.
-
A framework for extraction and transformation of documents
Authors:
Cristian Riveros,
Markus L. Schmid,
Nicole Schweikardt
Abstract:
We present a theoretical framework for the extraction and transformation of text documents. We propose to use a two-phase process where the first phase extracts span-tuples from a document, and the second phase maps the content of the span-tuples into new documents. We base the extraction phase on the framework of document spanners and the transformation phase on the theory of polyregular function…
▽ More
We present a theoretical framework for the extraction and transformation of text documents. We propose to use a two-phase process where the first phase extracts span-tuples from a document, and the second phase maps the content of the span-tuples into new documents. We base the extraction phase on the framework of document spanners and the transformation phase on the theory of polyregular functions, the class of regular string-to-string functions with polynomial growth.
For supporting practical extract-transform scenarios, we propose an extension of document spanners described by regex formulas from span-tuples to so-called multispan-tuples, where variables are mapped to sets of spans. We prove that this extension, called regex multispanners, has the same desirable properties as standard spanners described by regex formulas. In our framework, an Extract-Transform (ET) program is given by a regex multispanner followed by a polyregular function.
In this paper, we study the expressibility and evaluation problem of ET programs when the transformation function is linear, called linear ET programs. We show that linear ET programs are equally expressive as non-deterministic streaming string transducers under bag semantics. Moreover, we show that linear ET programs are closed under composition. Finally, we present an enumeration algorithm for evaluating every linear ET program over a document with linear time preprocessing and constant delay.
△ Less
Submitted 20 May, 2024;
originally announced May 2024.
-
Enumeration and updates for conjunctive linear algebra queries through expressibility
Authors:
Thomas Muñoz,
Cristian Riveros,
Stijn Vansummeren
Abstract:
Due to the importance of linear algebra and matrix operations in data analytics, there is significant interest in using relational query optimization and processing techniques for evaluating (sparse) linear algebra programs. In particular, in recent years close connections have been established between linear algebra programs and relational algebra that allow transferring optimization techniques o…
▽ More
Due to the importance of linear algebra and matrix operations in data analytics, there is significant interest in using relational query optimization and processing techniques for evaluating (sparse) linear algebra programs. In particular, in recent years close connections have been established between linear algebra programs and relational algebra that allow transferring optimization techniques of the latter to the former. In this paper, we ask ourselves which linear algebra programs in MATLANG correspond to the free-connex and q-hierarchical fragments of conjunctive first-order logic. Both fragments have desirable query processing properties: free-connex conjunctive queries support constant-delay enumeration after a linear-time preprocessing phase, and q-hierarchical conjunctive queries further allow constant-time updates. By characterizing the corresponding fragments of MATLANG, we hence identify the fragments of linear algebra programs that one can evaluate with constant-delay enumeration after linear-time preprocessing and with constant-time updates. To derive our results, we improve and generalize previous correspondences between MATLANG and relational algebra evaluated over semiring-annotated relations. In addition, we identify properties on semirings that allow to generalize the complexity bounds for free-connex and q-hierarchical conjunctive queries from Boolean annotations to general semirings.
△ Less
Submitted 9 November, 2023; v1 submitted 6 October, 2023;
originally announced October 2023.
-
Constant-delay enumeration for SLP-compressed documents
Authors:
Martín Muñoz,
Cristian Riveros
Abstract:
We study the problem of enumerating results from a query over a compressed document. The model we use for compression are straight-line programs (SLPs), which are defined by a context-free grammar that produces a single string. For our queries, we use a model called Annotated Automata, an extension of regular automata that allows annotations on letters. This model extends the notion of Regular Spa…
▽ More
We study the problem of enumerating results from a query over a compressed document. The model we use for compression are straight-line programs (SLPs), which are defined by a context-free grammar that produces a single string. For our queries, we use a model called Annotated Automata, an extension of regular automata that allows annotations on letters. This model extends the notion of Regular Spanners as it allows arbitrarily long outputs. Our main result is an algorithm that evaluates such a query by enumerating all results with output-linear delay after a preprocessing phase which takes linear time on the size of the SLP, and cubic time over the size of the automaton. This is an improvement over Schmid and Schweikardt's result, which, with the same preprocessing time, enumerates with a delay that is logarithmic on the size of the uncompressed document. We achieve this through a persistent data structure named Enumerable Compact Sets with Shifts which guarantees output-linear delay under certain restrictions. These results imply constant-delay enumeration algorithms in the context of regular spanners. Further, we use an extension of annotated automata which utilizes succinctly encoded annotations to save an exponential factor from previous results that dealt with constant-delay enumeration over vset automata. Lastly, we extend our results in the same fashion Schmid and Schweikardt did to allow complex document editing while maintaining the constant delay guarantee.
△ Less
Submitted 30 October, 2023; v1 submitted 25 September, 2022;
originally announced September 2022.
-
Probabilistic Automata of Bounded Ambiguity
Authors:
Nathanaël Fijalkow,
Cristian Riveros,
James Worrell
Abstract:
Probabilistic automata are an extension of nondeterministic finite automata in which transitions are annotated with probabilities. Despite its simplicity, this model is very expressive and many of the associated algorithmic questions are undecidable. In this work we focus on the emptiness problem (and its variant the value problem), which asks whether a given probabilistic automaton accepts some w…
▽ More
Probabilistic automata are an extension of nondeterministic finite automata in which transitions are annotated with probabilities. Despite its simplicity, this model is very expressive and many of the associated algorithmic questions are undecidable. In this work we focus on the emptiness problem (and its variant the value problem), which asks whether a given probabilistic automaton accepts some word with probability greater than a given threshold. We consider a natural and well-studied structural restriction on automata, namely the degree of ambiguity, which is defined as the maximum number of accepting runs over all words. The known undecidability proofs exploits infinite ambiguity and so we focus on the case of finitely ambiguous probabilistic automata. Our main contributions are to construct efficient algorithms for analysing finitely ambiguous probabilistic automata through a reduction to a multi-objective optimisation problem called the stochastic path problem. We obtain a polynomial time algorithm for approximating the value of probabilistic automata of fixed ambiguity and a quasi-polynomial time algorithm for the emptiness problem for 2-ambiguous probabilistic automata. We complement these positive results by an inapproximability result stating that the value of finitely ambiguous probabilistic automata cannot be approximated unless PTIME = NP.
△ Less
Submitted 19 May, 2022; v1 submitted 17 May, 2022;
originally announced May 2022.
-
Efficient Enumeration Algorithms for Annotated Grammars
Authors:
Antoine Amarilli,
Louis Jachiet,
Martín Muñoz,
Cristian Riveros
Abstract:
We introduce annotated grammars, an extension of context-free grammars which allows annotations on terminals. Our model extends the standard notion of regular spanners, and is more expressive than the extraction grammars recently introduced by Peterfreund. We study the enumeration problem for annotated grammars: fixing a grammar, and given a string as input, enumerate all annotations of the string…
▽ More
We introduce annotated grammars, an extension of context-free grammars which allows annotations on terminals. Our model extends the standard notion of regular spanners, and is more expressive than the extraction grammars recently introduced by Peterfreund. We study the enumeration problem for annotated grammars: fixing a grammar, and given a string as input, enumerate all annotations of the string that form a word derivable from the grammar. Our first result is an algorithm for unambiguous annotated grammars, which preprocesses the input string in cubic time and enumerates all annotations with output-linear delay. This improves over Peterfreund's result, which needs quintic time preprocessing to achieve this delay bound. We then study how we can reduce the preprocessing time while kee** the same delay bound, by making additional assumptions on the grammar. Specifically, we present a class of grammars which only have one derivation shape for all outputs, for which we can enumerate with quadratic time preprocessing. We also give classes that generalize regular spanners for which linear time preprocessing suffices.
△ Less
Submitted 17 May, 2022; v1 submitted 3 January, 2022;
originally announced January 2022.
-
How do centrality measures choose the root of trees?
Authors:
Cristian Riveros,
Jorge Salas,
Oskar Skibski
Abstract:
Centrality measures are widely used to assign importance to graph-structured data. Recently, understanding the principles of such measures has attracted a lot of attention. Given that measures are diverse, this research has usually focused on classes of centrality measures. In this work, we provide a different approach by focusing on classes of graphs instead of classes of measures to understand t…
▽ More
Centrality measures are widely used to assign importance to graph-structured data. Recently, understanding the principles of such measures has attracted a lot of attention. Given that measures are diverse, this research has usually focused on classes of centrality measures. In this work, we provide a different approach by focusing on classes of graphs instead of classes of measures to understand the underlying principles among various measures. More precisely, we study the class of trees. We observe that even in \fix{the} case of trees, there is no consensus on which node should be selected as the most central. To analyze the behavior of centrality measures on trees, we introduce a property of \emph{tree rooting} that states a measure selects one or two adjacent nodes as the most important, and the importance decreases from them in all directions. This property is satisfied by closeness centrality but violated by PageRank. We show that, for several centrality measures that root trees, the comparison of adjacent nodes can be inferred by \emph{potential functions} that assess the quality of trees. We use these functions to give fundamental insights on rooting and derive a characterization explaining why some measure root trees. Moreover, we provide an almost liner-time algorithm to compute the root of a graph by using potential functions. Finally, using a family of potential functions, we show that many ways of tree rooting exist with desirable properties.
△ Less
Submitted 29 November, 2022; v1 submitted 27 December, 2021;
originally announced December 2021.
-
CORE: a Complex Event Recognition Engine
Authors:
Marco Bucchi,
Alejandro Grez,
Andrés Quintana,
Cristian Riveros,
Stijn Vansummeren
Abstract:
Complex Event Recognition (CER) systems are a prominent technology for finding user-defined query patterns over large data streams in real time. CER query evaluation is known to be computationally challenging, since it requires maintaining a set of partial matches, and this set quickly grows super-linearly in the number of processed events. We present CORE, a novel COmplex event Recognition Engine…
▽ More
Complex Event Recognition (CER) systems are a prominent technology for finding user-defined query patterns over large data streams in real time. CER query evaluation is known to be computationally challenging, since it requires maintaining a set of partial matches, and this set quickly grows super-linearly in the number of processed events. We present CORE, a novel COmplex event Recognition Engine that focuses on the efficient evaluation of a large class of complex event queries, including time windows as well as the partition-by event correlation operator. This engine uses a novel automaton-based evaluation algorithm that circumvents the super-linear partial match problem: under data complexity, it takes constant time per input event to maintain a data structure that compactly represents the set of partial matches and, once a match is found, the query results may be enumerated from the data structure with output-linear delay. We experimentally compare CORE against state-of-the-art CER systems on real-world data. We show that (1) CORE's performance is stable with respect to both query and time window size, and (2) CORE outperforms the other systems by up to five orders of magnitude on different workloads.
△ Less
Submitted 26 May, 2022; v1 submitted 8 November, 2021;
originally announced November 2021.
-
MillenniumDB: A Persistent, Open-Source, Graph Database
Authors:
Domagoj Vrgoc,
Carlos Rojas,
Renzo Angles,
Marcelo Arenas,
Diego Arroyuelo,
Carlos Buil Aranda,
Aidan Hogan,
Gonzalo Navarro,
Cristian Riveros,
Juan Romero
Abstract:
In this systems paper, we present MillenniumDB: a novel graph database engine that is modular, persistent, and open source. MillenniumDB is based on a graph data model, which we call domain graphs, that provides a simple abstraction upon which a variety of popular graph models can be supported. The engine itself is founded on a combination of tried and tested techniques from relational data manage…
▽ More
In this systems paper, we present MillenniumDB: a novel graph database engine that is modular, persistent, and open source. MillenniumDB is based on a graph data model, which we call domain graphs, that provides a simple abstraction upon which a variety of popular graph models can be supported. The engine itself is founded on a combination of tried and tested techniques from relational data management, state-of-the-art algorithms for worst-case-optimal joins, as well as graph-specific algorithms for evaluating path queries. In this paper, we present the main design principles underlying MillenniumDB, describing the abstract graph model and query semantics supported, the concrete data model and query syntax implemented, as well as the storage, indexing, query planning and query evaluation techniques used. We evaluate MillenniumDB over real-world data and queries from the Wikidata knowledge graph, where we find that it outperforms other popular persistent graph database engines (including both enterprise and open source alternatives) that support similar query features.
△ Less
Submitted 2 November, 2021;
originally announced November 2021.
-
Expressive power of linear algebra query languages
Authors:
Floris Geerts,
Thomas Muñoz,
Cristian Riveros,
Domagoj Vrgoč
Abstract:
Linear algebra algorithms often require some sort of iteration or recursion as is illustrated by standard algorithms for Gaussian elimination, matrix inversion, and transitive closure. A key characteristic shared by these algorithms is that they allow loo** for a number of steps that is bounded by the matrix dimension. In this paper we extend the matrix query language MATLANG with this type of r…
▽ More
Linear algebra algorithms often require some sort of iteration or recursion as is illustrated by standard algorithms for Gaussian elimination, matrix inversion, and transitive closure. A key characteristic shared by these algorithms is that they allow loo** for a number of steps that is bounded by the matrix dimension. In this paper we extend the matrix query language MATLANG with this type of recursion, and show that this suffices to express classical linear algebra algorithms. We study the expressive power of this language and show that it naturally corresponds to arithmetic circuit families, which are often said to capture linear algebra. Furthermore, we analyze several sub-fragments of our language, and show that their expressive power is closely tied to logical formalisms on semiring-annotated relations.
△ Less
Submitted 26 October, 2020;
originally announced October 2020.
-
Ranked enumeration of MSO logic on words
Authors:
Pierre Bourhis,
Alejandro Grez,
Louis Jachiet,
Cristian Riveros
Abstract:
In the last years, enumeration algorithms with bounded delay have attracted a lot of attention for several data management tasks. Given a query and the data, the task is to preprocess the data and then enumerate all the answers to the query one by one and without repetitions. This enumeration scheme is typically useful when the solutions are treated on the fly or when we want to stop the enumerati…
▽ More
In the last years, enumeration algorithms with bounded delay have attracted a lot of attention for several data management tasks. Given a query and the data, the task is to preprocess the data and then enumerate all the answers to the query one by one and without repetitions. This enumeration scheme is typically useful when the solutions are treated on the fly or when we want to stop the enumeration once the pertinent solutions have been found. However, with the current schemes, there is no restriction on the order how the solutions are given and this order usually depends on the techniques used and not on the relevance for the user.
In this paper we study the enumeration of monadic second order logic (MSO) over words when the solutions are ranked. We present a framework based on MSO cost functions that allows to express MSO formulae on words with a cost associated with each solution. We then demonstrate the generality of our framework which subsumes, for instance, document spanners and regular complex event processing queries and adds ranking to them. The main technical result of the paper is an algorithm for enumerating all the solutions of formulae in increasing order of cost efficiently, namely, with a linear preprocessing phase and logarithmic delay between solutions. The novelty of this algorithm is based on using functional data structures, in particular, by extending functional Brodal queues to suit with the ranked enumeration of MSO on words.
△ Less
Submitted 15 October, 2020;
originally announced October 2020.
-
Streaming enumeration on nested documents
Authors:
Martín Muñoz,
Cristian Riveros
Abstract:
Some of the most relevant document schemas used online, such as XML and JSON, have a nested format. In the last decade, the task of extracting data from nested documents over streams has become especially relevant. We focus on the streaming evaluation of queries with outputs of varied sizes over nested documents. We model queries of this kind as Visibly Pushdown Transducers (VPT), a computational…
▽ More
Some of the most relevant document schemas used online, such as XML and JSON, have a nested format. In the last decade, the task of extracting data from nested documents over streams has become especially relevant. We focus on the streaming evaluation of queries with outputs of varied sizes over nested documents. We model queries of this kind as Visibly Pushdown Transducers (VPT), a computational model that extends visibly pushdown automata with outputs and has the same expressive power as MSO over nested documents. Since processing a document through a VPT can generate a massive number of results, we are interested in reading the input in a streaming fashion and enumerating the outputs one after another as efficiently as possible, namely, with constant-delay. This paper presents an algorithm that enumerates these elements with constant-delay after processing the document stream in a single pass. Furthermore, we show that this algorithm is worst-case optimal in terms of update-time per symbol and memory usage.
△ Less
Submitted 7 January, 2022; v1 submitted 12 October, 2020;
originally announced October 2020.
-
When is Approximate Counting for Conjunctive Queries Tractable?
Authors:
Marcelo Arenas,
Luis Alberto Croquevielle,
Rajesh Jayaram,
Cristian Riveros
Abstract:
Conjunctive queries are one of the most common class of queries used in database systems, and the best studied in the literature. A seminal result of Grohe, Schwentick, and Segoufin (STOC 2001) demonstrates that for every class $G$ of graphs, the evaluation of all conjunctive queries whose underlying graph is in $G$ is tractable if, and only if, $G$ has bounded treewidth. In this work, we extend t…
▽ More
Conjunctive queries are one of the most common class of queries used in database systems, and the best studied in the literature. A seminal result of Grohe, Schwentick, and Segoufin (STOC 2001) demonstrates that for every class $G$ of graphs, the evaluation of all conjunctive queries whose underlying graph is in $G$ is tractable if, and only if, $G$ has bounded treewidth. In this work, we extend this characterization to the counting problem for conjunctive queries. Specifically, for every class $C$ of conjunctive queries with bounded treewidth, we introduce the first fully polynomial-time randomized approximation scheme (FPRAS) for counting answers to a query in $C$, and the first polynomial-time algorithm for sampling answers uniformly from a query in $C$. As a corollary, it follows that for every class $G$ of graphs, the counting problem for conjunctive queries whose underlying graph is in $G$ admits an FPRAS if, and only if, $G$ has bounded treewidth (unless $\text{BPP} \neq \text{P}$)}. In fact, our FPRAS is more general, and also applies to conjunctive queries with bounded hypertree width, as well as unions of such queries.
The key ingredient in our proof is the resolution of a fundamental counting problem from automata theory. Specifically, we demonstrate the first FPRAS and polynomial time sampler for the set of trees of size $n$ accepted by a tree automaton, which improves the prior quasi-polynomial time randomized approximation scheme (QPRAS) and sampling algorithm of Gore, Jerrum, Kannan, Sweedyk, and Mahaney '97. We demonstrate how this algorithm can be used to obtain an FPRAS for many hitherto open problems, such as counting solutions to constraint satisfaction problems (CSP) with bounded hypertree-width, counting the number of error threads in programs with nested call subroutines, and counting valid assignments to structured DNNF circuits.
△ Less
Submitted 20 November, 2020; v1 submitted 20 May, 2020;
originally announced May 2020.
-
The monitoring problem for timed automata
Authors:
Alejandro Grez,
Filip Mazowiecki,
Michał Pilipczuk,
Gabriele Puppis,
Cristian Riveros
Abstract:
We study a variant of the classical membership problem in automata theory, which consists of deciding whether a given input word is accepted by a given automaton. We do so under a different perspective, that is, we consider a dynamic version of the problem, called monitoring problem, where the automaton is fixed and the input is revealed as in a stream, one symbol at a time following the natural o…
▽ More
We study a variant of the classical membership problem in automata theory, which consists of deciding whether a given input word is accepted by a given automaton. We do so under a different perspective, that is, we consider a dynamic version of the problem, called monitoring problem, where the automaton is fixed and the input is revealed as in a stream, one symbol at a time following the natural order on positions. The goal here is to design a dynamic data structure that can be queried about whether the word consisting of symbols revealed so far is accepted by the automaton, and that can be efficiently updated when the next symbol is revealed. We provide complexity bounds for this monitoring problem, by considering timed automata that process symbols interleaved with timestamps. The main contribution is that monitoring of a one-clock timed automaton, with all its components but the clock constants fixed, can be done in amortised constant time per input symbol.
△ Less
Submitted 17 February, 2020;
originally announced February 2020.
-
Pum** lemmas for weighted automata
Authors:
Agnishom Chattopadhyay,
Filip Mazowiecki,
Anca Muscholl,
Cristian Riveros
Abstract:
We present pum** lemmas for five classes of functions definable by fragments of weighted automata over the min-plus semiring, the max-plus semiring and the semiring of natural numbers. As a corollary we show that the hierarchy of functions definable by unambiguous, finitely-ambiguous, polynomially-ambiguous weighted automata, and the full class of weighted automata is strict for the min-plus and…
▽ More
We present pum** lemmas for five classes of functions definable by fragments of weighted automata over the min-plus semiring, the max-plus semiring and the semiring of natural numbers. As a corollary we show that the hierarchy of functions definable by unambiguous, finitely-ambiguous, polynomially-ambiguous weighted automata, and the full class of weighted automata is strict for the min-plus and max-plus semirings.
△ Less
Submitted 20 July, 2021; v1 submitted 17 January, 2020;
originally announced January 2020.
-
$\text{#NFA}$ admits an FPRAS: Efficient Enumeration, Counting, and Uniform Generation for Logspace Classes
Authors:
Marcelo Arenas,
Luis Alberto Croquevielle,
Rajesh Jayaram,
Cristian Riveros
Abstract:
In this work, we study two simple yet general complexity classes, based on logspace Turing machines, which provide a unifying framework for efficient query evaluation in areas like information extraction and graph databases, among others. We investigate the complexity of three fundamental algorithmic problems for these classes: enumeration, counting and uniform generation of solutions, and show th…
▽ More
In this work, we study two simple yet general complexity classes, based on logspace Turing machines, which provide a unifying framework for efficient query evaluation in areas like information extraction and graph databases, among others. We investigate the complexity of three fundamental algorithmic problems for these classes: enumeration, counting and uniform generation of solutions, and show that they have several desirable properties in this respect.
Both complexity classes are defined in terms of non-deterministic logspace transducers (NL transducers). For the first class, we consider the case of unambiguous NL transducers, and we prove constant delay enumeration, and both counting and uniform generation of solutions in polynomial time. For the second class, we consider unrestricted NL transducers, and we obtain polynomial delay enumeration, approximate counting in polynomial time, and polynomial-time randomized algorithms for uniform generation. More specifically, we show that each problem in this second class admits a fully polynomial-time randomized approximation scheme (FPRAS) and a polynomial-time Las Vegas algorithm for uniform generation. Interestingly, the key idea to prove these results is to show that the fundamental problem $\text{#NFA}$ admits an FPRAS, where $\text{#NFA}$ is the problem of counting the number of strings of length $n$ (given in unary) accepted by a non-deterministic finite automaton (NFA). While this problem is known to be $\text{#P}$-complete and, more precisely, $\text{SpanL}$-complete, it was open whether this problem admits an FPRAS. In this work, we solve this open problem, and obtain as a welcome corollary that every function in $\text{SpanL}$ admits an FPRAS.
△ Less
Submitted 23 June, 2021; v1 submitted 21 June, 2019;
originally announced June 2019.
-
Descriptive Complexity for Counting Complexity Classes
Authors:
Marcelo Arenas,
Martin Muñoz,
Cristian Riveros
Abstract:
Descriptive Complexity has been very successful in characterizing complexity classes of decision problems in terms of the properties definable in some logics. However, descriptive complexity for counting complexity classes, such as FP and #P, has not been systematically studied, and it is not as developed as its decision counterpart. In this paper, we propose a framework based on Weighted Logics t…
▽ More
Descriptive Complexity has been very successful in characterizing complexity classes of decision problems in terms of the properties definable in some logics. However, descriptive complexity for counting complexity classes, such as FP and #P, has not been systematically studied, and it is not as developed as its decision counterpart. In this paper, we propose a framework based on Weighted Logics to address this issue. Specifically, by focusing on the natural numbers we obtain a logic called Quantitative Second Order Logics (QSO), and show how some of its fragments can be used to capture fundamental counting complexity classes such as FP, #P and FPSPACE, among others. We also use QSO to define a hierarchy inside #P, identifying counting complexity classes with good closure and approximation properties, and which admit natural complete problems. Finally, we add recursion to QSO, and show how this extension naturally captures lower counting complexity classes such as #L.
△ Less
Submitted 7 February, 2020; v1 submitted 7 May, 2018;
originally announced May 2018.
-
Constant delay algorithms for regular document spanners
Authors:
Fernando Florenzano,
Cristian Riveros,
Martin Ugarte,
Stijn Vansummeren,
Domagoj Vrgoc
Abstract:
Regular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners, use regular languages in order to locate the data that a user wants to extract from a text document, and then store this data into variables. Since document spanners can easily generate large outputs, it is important to have go…
▽ More
Regular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners, use regular languages in order to locate the data that a user wants to extract from a text document, and then store this data into variables. Since document spanners can easily generate large outputs, it is important to have good evaluation algorithms that can generate the extracted data in a quick succession, and with relatively little precomputation time. Towards this goal, we present a practical evaluation algorithm that allows constant delay enumeration of a spanner's output after a precomputation phase that is linear in the document. While the algorithm assumes that the spanner is specified in a syntactic variant of variable set automata, we also study how it can be applied when the spanner is specified by general variable set automata, regex formulas, or spanner algebras. Finally, we study the related problem of counting the number of outputs of a document spanner, providing a fine grained analysis of the classes of document spanners that support efficient enumeration of their results.
△ Less
Submitted 14 March, 2018;
originally announced March 2018.
-
A Second-Order Approach to Complex Event Recognition
Authors:
Alejandro Grez,
Cristian Riveros,
Martin Ugarte,
Stijn Vansummeren
Abstract:
Complex Event Recognition (CER for short) refers to the activity of detecting patterns in streams of continuously arriving data. This field has been traditionally approached from a practical point of view, resulting in heterogeneous implementations with fundamentally different capabilities. The main reason behind this is that defining formal semantics for a CER language is not trivial: they usuall…
▽ More
Complex Event Recognition (CER for short) refers to the activity of detecting patterns in streams of continuously arriving data. This field has been traditionally approached from a practical point of view, resulting in heterogeneous implementations with fundamentally different capabilities. The main reason behind this is that defining formal semantics for a CER language is not trivial: they usually combine first-order variables for joining and filtering events with regular operators like sequencing and Kleene closure. Moreover, their semantics usually focus only on the detection of complex events, leaving the concept of output mostly unattended.
In this paper, we propose to unify the semantics and output of complex event recognition languages by using second order objects. Specifically, we introduce a CER language called Second Order Complex Event Logic (SO-CEL for short), that uses second order variables for managing and outputting sequences of events. This makes the definition of the main CER operators simple, allowing us to develop the first steps in understanding its expressive power. We start by comparing SO-CEL with a version that uses first-order variables called FO-CEL, showing that they are equivalent in expressive power when restricted to unary predicates but, surprisingly, incomparable in general. Nevertheless, we show that if we restrict to sets of binary predicates, then SO-CEL is strictly more expressive than FO-CEL. Then, we introduce a natural computational model called Unary Complex Event Automata (UCEA) that provides a better understanding of SO-CEL. We show that, under unary predicates, SO-CEL captures the subclass of UCEA that satisfy the so-called *-property. Finally, we identify the operations that SO-CEL is lacking to capture UCEA and introduce a natural extension of the language that captures the complete class of UCEA under unary predicates.
△ Less
Submitted 4 December, 2017;
originally announced December 2017.
-
Foundations of Complex Event Processing
Authors:
Marco Bucchi,
Alejandro Grez,
Cristian Riveros,
Martín Ugarte
Abstract:
Complex Event Processing (CEP) has emerged as the unifying field for technologies that require processing and correlating distributed data sources in real-time. CEP finds applications in diverse domains, which has resulted in a large number of proposals for expressing and processing complex events. However, existing CEP languages lack from a clear semantics, making them hard to understand and gene…
▽ More
Complex Event Processing (CEP) has emerged as the unifying field for technologies that require processing and correlating distributed data sources in real-time. CEP finds applications in diverse domains, which has resulted in a large number of proposals for expressing and processing complex events. However, existing CEP languages lack from a clear semantics, making them hard to understand and generalize. Moreover, there are no general techniques for evaluating CEP query languages with clear performance guarantees.
In this paper we embark on the task of giving a rigorous and efficient framework to CEP. We propose a formal language for specifying complex events, called CEL, that contains the main features used in the literature and has a denotational and compositional semantics. We also formalize the so-called selection strategies, which had only been presented as by-design extensions to existing frameworks. With a well-defined semantics at hand, we study how to efficiently evaluate CEL for processing complex events in the case of unary filters. We start by studying the syntactical properties of CEL and propose rewriting optimization techniques for simplifying the evaluation of formulas. Then, we introduce a formal computational model for CEP, called complex event automata (CEA), and study how to compile CEL formulas into CEA. Furthermore, we provide efficient algorithms for evaluating CEA over event streams using constant time per event followed by constant-delay enumeration of the results. By gathering these results together, we propose a framework for efficiently evaluating CEL with unary filters. Finally, we show experimentally that this framework consistently outperforms the competition, and even over trivial queries can be orders of magnitude more efficient.
△ Less
Submitted 30 August, 2018; v1 submitted 15 September, 2017;
originally announced September 2017.
-
Document Spanners for Extracting Incomplete Information: Expressiveness and Complexity
Authors:
Francisco Maturana,
Cristian Riveros,
Domagoj Vrgoč
Abstract:
Rule-based information extraction has lately received a fair amount of attention from the database community, with several languages appearing in the last few years. Although information extraction systems are intended to deal with semistructured data, all language proposals introduced so far are designed to output relations, thus making them incapable of handling incomplete information. To remedy…
▽ More
Rule-based information extraction has lately received a fair amount of attention from the database community, with several languages appearing in the last few years. Although information extraction systems are intended to deal with semistructured data, all language proposals introduced so far are designed to output relations, thus making them incapable of handling incomplete information. To remedy the situation, we propose to extend information extraction languages with the ability to use map**s, thus allowing us to work with documents which have missing or optional parts. Using this approach, we simplify the semantics of regex formulas and extraction rules, two previously defined methods for extracting information, extend them with the ability to handle incomplete data, and study how they compare in terms of expressive power. We also study computational properties of these languages, focusing on the query enumeration problem, as well as satisfiability and containment.
△ Less
Submitted 29 December, 2017; v1 submitted 4 July, 2017;
originally announced July 2017.
-
Copyless Cost-Register Automata: Structure, Expressiveness, and Closure Properties
Authors:
Filip Mazowiecki,
Cristian Riveros
Abstract:
Cost register automata (CRA) and its subclass, copyless CRA, were recently proposed by Alur et al. as a new model for computing functions over strings. We study some structural properties, expressiveness, and closure properties of copyless CRA. We show that copyless CRA are strictly less expressive than weighted automata and are not closed under reverse operation. To find a better class we impose…
▽ More
Cost register automata (CRA) and its subclass, copyless CRA, were recently proposed by Alur et al. as a new model for computing functions over strings. We study some structural properties, expressiveness, and closure properties of copyless CRA. We show that copyless CRA are strictly less expressive than weighted automata and are not closed under reverse operation. To find a better class we impose restrictions on copyless CRA, which ends successfully with a new robust computational model that is closed under reverse and other extensions.
△ Less
Submitted 27 April, 2018; v1 submitted 7 April, 2015;
originally announced April 2015.
-
Composition and Inversion of Schema Map**s
Authors:
Marcelo Arenas,
Jorge Perez,
Juan Reutter,
Cristian Riveros
Abstract:
In the recent years, a lot of attention has been paid to the development of solid foundations for the composition and inversion of schema map**s. In this paper, we review the proposals for the semantics of these crucial operators. For each of these proposals, we concentrate on the three following problems: the definition of the semantics of the operator, the language needed to express the oper…
▽ More
In the recent years, a lot of attention has been paid to the development of solid foundations for the composition and inversion of schema map**s. In this paper, we review the proposals for the semantics of these crucial operators. For each of these proposals, we concentrate on the three following problems: the definition of the semantics of the operator, the language needed to express the operator, and the algorithmic issues associated to the problem of computing the operator. It should be pointed out that we primarily consider the formalization of schema map**s introduced in the work on data exchange. In particular, when studying the problem of computing the composition and inverse of a schema map**, we will be mostly interested in computing these operators for map**s specified by source-to-target tuple-generating dependencies.
△ Less
Submitted 4 March, 2010; v1 submitted 18 October, 2009;
originally announced October 2009.