-
Tractable Conjunctive Queries over Static and Dynamic Relations
Authors:
Ahmet Kara,
Zheng Luo,
Milos Nikolic,
Dan Olteanu,
Haozhe Zhang
Abstract:
We investigate the evaluation of conjunctive queries over static and dynamic relations. While static relations are given as input and do not change, dynamic relations are subject to inserts and deletes.
We characterise syntactically three classes of queries that admit constant update time and constant enumeration delay. We call such queries tractable. Depending on the class, the preprocessing ti…
▽ More
We investigate the evaluation of conjunctive queries over static and dynamic relations. While static relations are given as input and do not change, dynamic relations are subject to inserts and deletes.
We characterise syntactically three classes of queries that admit constant update time and constant enumeration delay. We call such queries tractable. Depending on the class, the preprocessing time is linear, polynomial, or exponential (under data complexity, so the query size is constant).
To decide whether a query is tractable, it does not suffice to analyse separately the sub-query over the static relations and the sub-query over the dynamic relations. Instead, we need to take the interaction between the static and the dynamic relations into account. Even when the sub-query over the dynamic relations is not tractable, the overall query can become tractable if the dynamic relations are sufficiently constrained by the static ones.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
Insert-Only versus Insert-Delete in Dynamic Query Evaluation
Authors:
Mahmoud Abo Khamis,
Ahmet Kara,
Dan Olteanu,
Dan Suciu
Abstract:
We study the dynamic query evaluation problem: Given a join query Q and a sequence of updates, we would like to construct a data structure that supports constant-delay enumeration of the query output after each update.
We show that a sequence of N insert-only updates (to an initially empty database) can be executed in total time O(N^{w(Q)}), where w(Q) is the fractional hypertree width of Q. Thi…
▽ More
We study the dynamic query evaluation problem: Given a join query Q and a sequence of updates, we would like to construct a data structure that supports constant-delay enumeration of the query output after each update.
We show that a sequence of N insert-only updates (to an initially empty database) can be executed in total time O(N^{w(Q)}), where w(Q) is the fractional hypertree width of Q. This matches the complexity of the static query evaluation problem for Q and a database of size N. One corollary is that the average time per single-tuple insert is constant for acyclic joins.
In contrast, we show that a sequence of N insert-and-delete updates to Q can be executed in total time O(N^{w(Q')}), where Q' is obtained from Q by extending every relational atom with extra variables that represent the "lifespans" of tuples in Q. We show that this reduction is optimal in the sense that the static evaluation runtime of Q' provides a lower bound on the total update time of Q. Our approach recovers the optimal single-tuple update time for known queries such as the hierarchical and Loomis-Whitney join queries.
△ Less
Submitted 8 June, 2024; v1 submitted 14 December, 2023;
originally announced December 2023.
-
Q-Learning for Stochastic Control under General Information Structures and Non-Markovian Environments
Authors:
Ali Devran Kara,
Serdar Yuksel
Abstract:
As a primary contribution, we present a convergence theorem for stochastic iterations, and in particular, Q-learning iterates, under a general, possibly non-Markovian, stochastic environment. Our conditions for convergence involve an ergodicity and a positivity criterion. We provide a precise characterization on the limit of the iterates and conditions on the environment and initializations for co…
▽ More
As a primary contribution, we present a convergence theorem for stochastic iterations, and in particular, Q-learning iterates, under a general, possibly non-Markovian, stochastic environment. Our conditions for convergence involve an ergodicity and a positivity criterion. We provide a precise characterization on the limit of the iterates and conditions on the environment and initializations for convergence. As our second contribution, we discuss the implications and applications of this theorem to a variety of stochastic control problems with non-Markovian environments involving (i) quantized approximations of fully observed Markov Decision Processes (MDPs) with continuous spaces (where quantization break down the Markovian structure), (ii) quantized approximations of belief-MDP reduced partially observable MDPS (POMDPs) with weak Feller continuity and a mild version of filter stability (which requires the knowledge of the model by the controller), (iii) finite window approximations of POMDPs under a uniform controlled filter stability (which does not require the knowledge of the model), and (iv) for multi-agent models where convergence of learning dynamics to a new class of equilibria, subjective Q-learning equilibria, will be studied. In addition to the convergence theorem, some implications of the theorem above are new to the literature and others are interpreted as applications of the convergence theorem. Some open problems are noted.
△ Less
Submitted 4 March, 2024; v1 submitted 31 October, 2023;
originally announced November 2023.
-
GECTurk: Grammatical Error Correction and Detection Dataset for Turkish
Authors:
Atakan Kara,
Farrin Marouf Sofian,
Andrew Bond,
Gözde Gül Şahin
Abstract:
Grammatical Error Detection and Correction (GEC) tools have proven useful for native speakers and second language learners. Develo** such tools requires a large amount of parallel, annotated data, which is unavailable for most languages. Synthetic data generation is a common practice to overcome the scarcity of such data. However, it is not straightforward for morphologically rich languages like…
▽ More
Grammatical Error Detection and Correction (GEC) tools have proven useful for native speakers and second language learners. Develo** such tools requires a large amount of parallel, annotated data, which is unavailable for most languages. Synthetic data generation is a common practice to overcome the scarcity of such data. However, it is not straightforward for morphologically rich languages like Turkish due to complex writing rules that require phonological, morphological, and syntactic information. In this work, we present a flexible and extensible synthetic data generation pipeline for Turkish covering more than 20 expert-curated grammar and spelling rules (a.k.a., writing rules) implemented through complex transformation functions. Using this pipeline, we derive 130,000 high-quality parallel sentences from professionally edited articles. Additionally, we create a more realistic test set by manually annotating a set of movie reviews. We implement three baselines formulating the task as i) neural machine translation, ii) sequence tagging, and iii) prefix tuning with a pretrained decoder-only model, achieving strong results. Furthermore, we perform exhaustive experiments on out-of-domain datasets to gain insights on the transferability and robustness of the proposed approaches. Our results suggest that our corpus, GECTurk, is high-quality and allows knowledge transfer for the out-of-domain setting. To encourage further research on Turkish GEC, we release our datasets, baseline models, and the synthetic data generation pipeline at https://github.com/GGLAB-KU/gecturk.
△ Less
Submitted 20 September, 2023;
originally announced September 2023.
-
Banzhaf Values for Facts in Query Answering
Authors:
Omer Abramovich,
Daniel Deutch,
Nave Frost,
Ahmet Kara,
Dan Olteanu
Abstract:
Quantifying the contribution of database facts to query answers has been studied as means of explanation. The Banzhaf value, originally developed in Game Theory, is a natural measure of fact contribution, yet its efficient computation for select-project-join-union queries is challenging. In this paper, we introduce three algorithms to compute the Banzhaf value of database facts: an exact algorithm…
▽ More
Quantifying the contribution of database facts to query answers has been studied as means of explanation. The Banzhaf value, originally developed in Game Theory, is a natural measure of fact contribution, yet its efficient computation for select-project-join-union queries is challenging. In this paper, we introduce three algorithms to compute the Banzhaf value of database facts: an exact algorithm, an anytime deterministic approximation algorithm with relative error guarantees, and an algorithm for ranking and top-$k$. They have three key building blocks: compilation of query lineage into an equivalent function that allows efficient Banzhaf value computation; dynamic programming computation of the Banzhaf values of variables in a Boolean function using the Banzhaf values for constituent functions; and a mechanism to compute efficiently lower and upper bounds on Banzhaf values for any positive DNF function.
We complement the algorithms with a dichotomy for the Banzhaf-based ranking problem: given two facts, deciding whether the Banzhaf value of one is greater than of the other is tractable for hierarchical queries and intractable for non-hierarchical queries.
We show experimentally that our algorithms significantly outperform exact and approximate algorithms from prior work, most times up to two orders of magnitude. Our algorithms can also cover challenging problem instances that are beyond reach for prior work.
△ Less
Submitted 10 August, 2023;
originally announced August 2023.
-
ADOPT: Adaptively Optimizing Attribute Orders for Worst-Case Optimal Join Algorithms via Reinforcement Learning
Authors:
Junxiong Wang,
Immanuel Trummer,
Ahmet Kara,
Dan Olteanu
Abstract:
The performance of worst-case optimal join algorithms depends on the order in which the join attributes are processed. Selecting good orders before query execution is hard, due to the large space of possible orders and unreliable execution cost estimates in case of data skew or data correlation. We propose ADOPT, a query engine that combines adaptive query processing with a worst-case optimal join…
▽ More
The performance of worst-case optimal join algorithms depends on the order in which the join attributes are processed. Selecting good orders before query execution is hard, due to the large space of possible orders and unreliable execution cost estimates in case of data skew or data correlation. We propose ADOPT, a query engine that combines adaptive query processing with a worst-case optimal join algorithm, which uses an order on the join attributes instead of a join order on relations. ADOPT divides query execution into episodes in which different attribute orders are tried. Based on run time feedback on attribute order performance, ADOPT converges quickly to near-optimal orders. It avoids redundant work across different orders via a novel data structure, kee** track of parts of the join input that have been successfully processed. It selects attribute orders to try via reinforcement learning, balancing the need for exploring new orders with the desire to exploit promising orders. In experiments with various data sets and queries, it outperforms baselines, including commercial and open-source systems using worst-case optimal join algorithms, whenever queries become complex and therefore difficult to optimize.
△ Less
Submitted 31 July, 2023;
originally announced July 2023.
-
From Shapley Value to Model Counting and Back
Authors:
Ahmet Kara,
Dan Olteanu,
Dan Suciu
Abstract:
In this paper we investigate the problem of quantifying the contribution of each variable to the satisfying assignments of a Boolean function based on the Shapley value.
Our main result is a polynomial-time equivalence between computing Shapley values and model counting for any class of Boolean functions that are closed under substitutions of variables with disjunctions of fresh variables. This…
▽ More
In this paper we investigate the problem of quantifying the contribution of each variable to the satisfying assignments of a Boolean function based on the Shapley value.
Our main result is a polynomial-time equivalence between computing Shapley values and model counting for any class of Boolean functions that are closed under substitutions of variables with disjunctions of fresh variables. This result settles an open problem raised in prior work, which sought to connect the Shapley value computation to probabilistic query evaluation.
We show two applications of our result. First, the Shapley values can be computed in polynomial time over deterministic and decomposable circuits, since they are closed under OR-substitutions. Second, there is a polynomial-time equivalence between computing the Shapley value for the tuples contributing to the answer of a Boolean conjunctive query and counting the models in the lineage of the query. This equivalence allows us to immediately recover the dichotomy for Shapley value computation in case of self-join-free Boolean conjunctive queries; in particular, the hardness for non-hierarchical queries can now be shown using a simple reduction from the #P-hard problem of model counting for lineage in positive bipartite disjunctive normal form.
△ Less
Submitted 25 June, 2023;
originally announced June 2023.
-
F-IVM: Analytics over Relational Databases under Updates
Authors:
Ahmet Kara,
Milos Nikolic,
Dan Olteanu,
Haozhe Zhang
Abstract:
This article describes F-IVM, a unified approach for maintaining analytics over changing relational data. We exemplify its versatility in four disciplines: processing queries with group-by aggregates and joins; learning linear regression models using the covariance matrix of the input features; building Chow-Liu trees using pairwise mutual information of the input features; and matrix chain multip…
▽ More
This article describes F-IVM, a unified approach for maintaining analytics over changing relational data. We exemplify its versatility in four disciplines: processing queries with group-by aggregates and joins; learning linear regression models using the covariance matrix of the input features; building Chow-Liu trees using pairwise mutual information of the input features; and matrix chain multiplication.
F-IVM has three main ingredients: higher-order incremental view maintenance; factorized computation; and ring abstraction. F-IVM reduces the maintenance of a task to that of a hierarchy of simple views. Such views are functions map** keys, which are tuples of input values, to payloads, which are elements from a ring. F-IVM also supports efficient factorized computation over keys, payloads, and updates. Finally, F-IVM treats uniformly seemingly disparate tasks. In the key space, all tasks require joins and variable marginalization. In the payload space, tasks differ in the definition of the sum and product ring operations.
We implemented F-IVM on top of DBToaster and show that it can outperform classical first-order and fully recursive higher-order incremental view maintenance by orders of magnitude while using less memory.
△ Less
Submitted 29 January, 2024; v1 submitted 15 March, 2023;
originally announced March 2023.
-
Extracting Relations Between Sectors
Authors:
Atakan Kara,
F. Serhan Daniş,
Günce Keziban Orman,
Sultan Nezihe Turhan
Abstract:
The term "sector" in professional business life is a vague concept since companies tend to identify themselves as operating in multiple sectors simultaneously. This ambiguity poses problems in recommending jobs to job seekers or finding suitable candidates for open positions. The latter holds significant importance when available candidates in a specific sector are also scarce; hence, finding cand…
▽ More
The term "sector" in professional business life is a vague concept since companies tend to identify themselves as operating in multiple sectors simultaneously. This ambiguity poses problems in recommending jobs to job seekers or finding suitable candidates for open positions. The latter holds significant importance when available candidates in a specific sector are also scarce; hence, finding candidates from similar sectors becomes crucial. This work focuses on discovering possible sector similarities through relational analysis. We employ several algorithms from the frequent pattern mining and collaborative filtering domains, namely negFIN, Alternating Least Squares, Bilateral Variational Autoencoder, and Collaborative Filtering based on Pearson's Correlation, Kendall and Spearman's Rank Correlation coefficients. The algorithms are compared on a real-world dataset supplied by a major recruitment company, Kariyer.net, from Turkey. The insights and methods gained through this work are expected to increase the efficiency and accuracy of various methods, such as recommending jobs or finding suitable candidates for open positions.
△ Less
Submitted 30 August, 2022;
originally announced August 2022.
-
Conjunctive Queries with Free Access Patterns under Updates
Authors:
Ahmet Kara,
Milos Nikolic,
Dan Olteanu,
Haozhe Zhang
Abstract:
We study the problem of answering conjunctive queries with free access patterns (CQAP) under updates. A free access pattern is a partition of the free variables of the query into input and output. The query returns tuples over the output variables given a tuple of values over the input variables.
We introduce a fully dynamic evaluation approach for CQAP queries. We also give a syntactic characte…
▽ More
We study the problem of answering conjunctive queries with free access patterns (CQAP) under updates. A free access pattern is a partition of the free variables of the query into input and output. The query returns tuples over the output variables given a tuple of values over the input variables.
We introduce a fully dynamic evaluation approach for CQAP queries. We also give a syntactic characterisation of those CQAP queries that admit constant time per single-tuple update and whose output tuples can be enumerated with constant delay given a tuple of values over the input variables. Finally, we chart the complexity trade-off between the preprocessing time, update time and enumeration delay for CQAP queries. For a class of CQAP queries, our approach achieves optimal, albeit non-constant, update time and delay. Their optimality is predicated on the Online Matrix-Vector Multiplication conjecture. Our results recover prior work on the dynamic evaluation of conjunctive queries without access patterns. We also illustrate an application of our dynamic evaluation approach to tractable CQAP queries over probabilistic databases.
△ Less
Submitted 14 February, 2024; v1 submitted 17 June, 2022;
originally announced June 2022.
-
Efficient Online Bayesian Inference for Neural Bandits
Authors:
Gerardo Duran-Martin,
Aleyna Kara,
Kevin Murphy
Abstract:
In this paper we present a new algorithm for online (sequential) inference in Bayesian neural networks, and show its suitability for tackling contextual bandit problems. The key idea is to combine the extended Kalman filter (which locally linearizes the likelihood function at each time step) with a (learned or random) low-dimensional affine subspace for the parameters; the use of a subspace enable…
▽ More
In this paper we present a new algorithm for online (sequential) inference in Bayesian neural networks, and show its suitability for tackling contextual bandit problems. The key idea is to combine the extended Kalman filter (which locally linearizes the likelihood function at each time step) with a (learned or random) low-dimensional affine subspace for the parameters; the use of a subspace enables us to scale our algorithm to models with $\sim 1M$ parameters. While most other neural bandit methods need to store the entire past dataset in order to avoid the problem of "catastrophic forgetting", our approach uses constant memory. This is possible because we represent uncertainty about all the parameters in the model, not just the final linear layer. We show good results on the "Deep Bayesian Bandit Showdown" benchmark, as well as MNIST and a recommender system.
△ Less
Submitted 30 November, 2021;
originally announced December 2021.
-
Q-Learning for MDPs with General Spaces: Convergence and Near Optimality via Quantization under Weak Continuity
Authors:
Ali Devran Kara,
Naci Saldi,
Serdar Yüksel
Abstract:
Reinforcement learning algorithms often require finiteness of state and action spaces in Markov decision processes (MDPs) (also called controlled Markov chains) and various efforts have been made in the literature towards the applicability of such algorithms for continuous state and action spaces. In this paper, we show that under very mild regularity conditions (in particular, involving only weak…
▽ More
Reinforcement learning algorithms often require finiteness of state and action spaces in Markov decision processes (MDPs) (also called controlled Markov chains) and various efforts have been made in the literature towards the applicability of such algorithms for continuous state and action spaces. In this paper, we show that under very mild regularity conditions (in particular, involving only weak continuity of the transition kernel of an MDP), Q-learning for standard Borel MDPs via quantization of states and actions (called Quantized Q-Learning) converges to a limit, and furthermore this limit satisfies an optimality equation which leads to near optimality with either explicit performance bounds or which are guaranteed to be asymptotically optimal. Our approach builds on (i) viewing quantization as a measurement kernel and thus a quantized MDP as a partially observed Markov decision process (POMDP), (ii) utilizing near optimality and convergence results of Q-learning for POMDPs, and (iii) finally, near-optimality of finite state model approximations for MDPs with weakly continuous kernels which we show to correspond to the fixed point of the constructed POMDP. Thus, our paper presents a very general convergence and approximation result for the applicability of Q-learning for continuous MDPs.
△ Less
Submitted 7 September, 2023; v1 submitted 12 November, 2021;
originally announced November 2021.
-
Machine Learning over Static and Dynamic Relational Data
Authors:
Ahmet Kara,
Milos Nikolic,
Dan Olteanu,
Haozhe Zhang
Abstract:
This tutorial overviews principles behind recent works on training and maintaining machine learning models over relational data, with an emphasis on the exploitation of the relational data structure to improve the runtime performance of the learning task.
The tutorial has the following parts:
1) Database research for data science
2) Three main ideas to achieve performance improvements
2.1)…
▽ More
This tutorial overviews principles behind recent works on training and maintaining machine learning models over relational data, with an emphasis on the exploitation of the relational data structure to improve the runtime performance of the learning task.
The tutorial has the following parts:
1) Database research for data science
2) Three main ideas to achieve performance improvements
2.1) Turn the ML problem into a DB problem
2.2) Exploit structure of the data and problem
2.3) Exploit engineering tools of a DB researcher
3) Avenues for future research
△ Less
Submitted 29 July, 2021;
originally announced July 2021.
-
Convergence of Finite Memory Q-Learning for POMDPs and Near Optimality of Learned Policies under Filter Stability
Authors:
Ali Devran Kara,
Serdar Yuksel
Abstract:
In this paper, for POMDPs, we provide the convergence of a Q learning algorithm for control policies using a finite history of past observations and control actions, and, consequentially, we establish near optimality of such limit Q functions under explicit filter stability conditions. We present explicit error bounds relating the approximation error to the length of the finite history window. We…
▽ More
In this paper, for POMDPs, we provide the convergence of a Q learning algorithm for control policies using a finite history of past observations and control actions, and, consequentially, we establish near optimality of such limit Q functions under explicit filter stability conditions. We present explicit error bounds relating the approximation error to the length of the finite history window. We establish the convergence of such Q-learning iterations under mild ergodicity assumptions on the state process during the exploration phase. We further show that the limit fixed point equation gives an optimal solution for an approximate belief-MDP. We then provide bounds on the performance of the policy obtained using the limit Q values compared to the performance of the optimal policy for the POMDP, where we also present explicit conditions using recent results on filter stability in controlled POMDPs. While there exist many experimental results, (i) the rigorous asymptotic convergence (to an approximate MDP value function) for such finite-memory Q-learning algorithms, and (ii) the near optimality with an explicit rate of convergence (in the memory size) are results that are new to the literature, to our knowledge.
△ Less
Submitted 25 October, 2022; v1 submitted 22 March, 2021;
originally announced March 2021.
-
Near Optimality of Finite Memory Feedback Policies in Partially Observed Markov Decision Processes
Authors:
Ali Devran Kara,
Serdar Yuksel
Abstract:
In the theory of Partially Observed Markov Decision Processes (POMDPs), existence of optimal policies have in general been established via converting the original partially observed stochastic control problem to a fully observed one on the belief space, leading to a belief-MDP. However, computing an optimal policy for this fully observed model, and so for the original POMDP, using classical dynami…
▽ More
In the theory of Partially Observed Markov Decision Processes (POMDPs), existence of optimal policies have in general been established via converting the original partially observed stochastic control problem to a fully observed one on the belief space, leading to a belief-MDP. However, computing an optimal policy for this fully observed model, and so for the original POMDP, using classical dynamic or linear programming methods is challenging even if the original system has finite state and action spaces, since the state space of the fully observed belief-MDP model is always uncountable. Furthermore, there exist very few rigorous value function approximation and optimal policy approximation results, as regularity conditions needed often require a tedious study involving the spaces of probability measures leading to properties such as Feller continuity. In this paper, we study a planning problem for POMDPs where the system dynamics and measurement channel model are assumed to be known. We construct an approximate belief model by discretizing the belief space using only finite window information variables. We then find optimal policies for the approximate model and we rigorously establish near optimality of the constructed finite window control policies in POMDPs under mild non-linear filter stability conditions and the assumption that the measurement and action sets are finite (and the state space is real vector valued). We also establish a rate of convergence result which relates the finite window memory size and the approximation error bound, where the rate of convergence is exponential under explicit and testable exponential filter stability conditions. While there exist many experimental results and few rigorous asymptotic convergence results, an explicit rate of convergence result is new in the literature, to our knowledge.
△ Less
Submitted 8 January, 2022; v1 submitted 14 October, 2020;
originally announced October 2020.
-
F-IVM: Learning over Fast-Evolving Relational Data
Authors:
Milos Nikolic,
Haozhe Zhang,
Ahmet Kara,
Dan Olteanu
Abstract:
F-IVM is a system for real-time analytics such as machine learning applications over training datasets defined by queries over fast-evolving relational databases. We will demonstrate F-IVM for three such applications: model selection, Chow-Liu trees, and ridge linear regression.
F-IVM is a system for real-time analytics such as machine learning applications over training datasets defined by queries over fast-evolving relational databases. We will demonstrate F-IVM for three such applications: model selection, Chow-Liu trees, and ridge linear regression.
△ Less
Submitted 31 May, 2020;
originally announced June 2020.
-
Maintaining Triangle Queries under Updates
Authors:
Ahmet Kara,
Milos Nikolic,
Hung Q. Ngo,
Dan Olteanu,
Haozhe Zhang
Abstract:
We consider the problem of incrementally maintaining the triangle queries with arbitrary free variables under single-tuple updates to the input relations. We introduce an approach called IVM$^ε$ that exhibits a trade-off between the update time, the space, and the delay for the enumeration of the query result, such that the update time ranges from the square root to linear in the database size whi…
▽ More
We consider the problem of incrementally maintaining the triangle queries with arbitrary free variables under single-tuple updates to the input relations. We introduce an approach called IVM$^ε$ that exhibits a trade-off between the update time, the space, and the delay for the enumeration of the query result, such that the update time ranges from the square root to linear in the database size while the delay ranges from constant to linear time. IVM$^ε$ achieves Pareto worst-case optimality in the update-delay space conditioned on the Online Matrix-Vector Multiplication conjecture. It is strongly Pareto optimal for the triangle queries with zero or three free variables and weakly Pareto optimal for the triangle queries with one or two free variables.
△ Less
Submitted 7 April, 2020;
originally announced April 2020.
-
Trade-offs in Static and Dynamic Evaluation of Hierarchical Queries
Authors:
Ahmet Kara,
Milos Nikolic,
Dan Olteanu,
Haozhe Zhang
Abstract:
We investigate trade-offs in static and dynamic evaluation of hierarchical queries with arbitrary free variables. In the static setting, the trade-off is between the time to partially compute the query result and the delay needed to enumerate its tuples. In the dynamic setting, we additionally consider the time needed to update the query result under single-tuple inserts or deletes to the database…
▽ More
We investigate trade-offs in static and dynamic evaluation of hierarchical queries with arbitrary free variables. In the static setting, the trade-off is between the time to partially compute the query result and the delay needed to enumerate its tuples. In the dynamic setting, we additionally consider the time needed to update the query result under single-tuple inserts or deletes to the database.
Our approach observes the degree of values in the database and uses different computation and maintenance strategies for high-degree (heavy) and low-degree (light) values. For the latter it partially computes the result, while for the former it computes enough information to allow for on-the-fly enumeration.
We define the preprocessing time, the update time, and the enumeration delay as functions of the light/heavy threshold. By appropriately choosing this threshold, our approach recovers a number of prior results when restricted to hierarchical queries.
We show that for a restricted class of hierarchical queries, our approach achieves worst-case optimal update time and enumeration delay conditioned on the Online Matrix-Vector Multiplication Conjecture.
△ Less
Submitted 8 August, 2023; v1 submitted 3 July, 2019;
originally announced July 2019.
-
Incremental Techniques for Large-Scale Dynamic Query Processing
Authors:
Iman Elghandour,
Ahmet Kara,
Dan Olteanu,
Stijn Vansummeren
Abstract:
Many applications from various disciplines are now required to analyze fast evolving big data in real time. Various approaches for incremental processing of queries have been proposed over the years. Traditional approaches rely on updating the results of a query when updates are streamed rather than re-computing these queries, and therefore, higher execution performance is expected. However, they…
▽ More
Many applications from various disciplines are now required to analyze fast evolving big data in real time. Various approaches for incremental processing of queries have been proposed over the years. Traditional approaches rely on updating the results of a query when updates are streamed rather than re-computing these queries, and therefore, higher execution performance is expected. However, they do not perform well for large databases that are updated at high frequencies. Therefore, new algorithms and approaches have been proposed in the literature to address these challenges by, for instance, reducing the complexity of processing updates. Moreover, many of these algorithms are now leveraging distributed streaming platforms such as Spark Streaming and Flink. In this tutorial, we briefly discuss legacy approaches for incremental query processing, and then give an overview of the new challenges introduced due to processing big data streams. We then discuss in detail the recently proposed algorithms that address some of these challenges. We emphasize the characteristics and algorithmic analysis of various proposed approaches and conclude by discussing future research directions.
△ Less
Submitted 1 February, 2019;
originally announced February 2019.
-
Counting Triangles under Updates in Worst-Case Optimal Time
Authors:
Ahmet Kara,
Hung Q. Ngo,
Milos Nikolic,
Dan Olteanu,
Haozhe Zhang
Abstract:
We consider the problem of incrementally maintaining the triangle count query under single-tuple updates to the input relations. We introduce an approach that exhibits a space-time tradeoff such that the space-time product is quadratic in the size of the input database and the update time can be as low as the square root of this size. This lowest update time is worst-case optimal conditioned on th…
▽ More
We consider the problem of incrementally maintaining the triangle count query under single-tuple updates to the input relations. We introduce an approach that exhibits a space-time tradeoff such that the space-time product is quadratic in the size of the input database and the update time can be as low as the square root of this size. This lowest update time is worst-case optimal conditioned on the Online Matrix-Vector Multiplication conjecture. The classical and factorized incremental view maintenance approaches are recovered as special cases of our approach within the space-time tradeoff. In particular, they require linear-time update maintenance, which is suboptimal. Our approach also recovers the worst-case optimal time complexity for computing the triangle count in the non-incremental setting.
△ Less
Submitted 25 March, 2019; v1 submitted 8 April, 2018;
originally announced April 2018.
-
Covers of Query Results
Authors:
Ahmet Kara,
Dan Olteanu
Abstract:
We introduce succinct lossless representations of query results called covers. They are subsets of the query results that correspond to minimal edge covers in the hypergraphs of these results.
We first study covers whose structures are given by fractional hypertree decompositions of join queries. For any decomposition of a query, we give asymptotically tight size bounds for the covers of the que…
▽ More
We introduce succinct lossless representations of query results called covers. They are subsets of the query results that correspond to minimal edge covers in the hypergraphs of these results.
We first study covers whose structures are given by fractional hypertree decompositions of join queries. For any decomposition of a query, we give asymptotically tight size bounds for the covers of the query result over that decomposition and show that such covers can be computed in worst-case optimal time up to a logarithmic factor in the database size. For acyclic join queries, we can compute covers compositionally using query plans with a new operator called cover-join. The tuples in the query result can be enumerated from any of its covers with linearithmic pre-computation time and constant delay.
We then generalize covers from joins to functional aggregate queries that express a host of computational problems such as aggregate-join queries, in-database optimization, matrix chain multiplication, and inference in probabilistic graphical models.
△ Less
Submitted 10 January, 2018; v1 submitted 5 September, 2017;
originally announced September 2017.
-
Cyclic Association Rules Mining under Constraints
Authors:
Wafa Tebourski Wahiba Ben Abdessalem Karaa
Abstract:
Several researchers have explored the temporal aspect of association rules mining. In this paper, we focus on the cyclic association rules, in order to discover correlations among items characterized by regular cyclic variation overtime. The overview of the state of the art has revealed the drawbacks of proposed algorithm literatures, namely the excessive number of generated rules which are not me…
▽ More
Several researchers have explored the temporal aspect of association rules mining. In this paper, we focus on the cyclic association rules, in order to discover correlations among items characterized by regular cyclic variation overtime. The overview of the state of the art has revealed the drawbacks of proposed algorithm literatures, namely the excessive number of generated rules which are not meeting the expert's expectations. To overcome these restrictions, we have introduced our approach dedicated to generate the cyclic association rules under constraints through a new method called Constraint-Based Cyclic Association Rules CBCAR. The carried out experiments underline the usefulness and the performance of our new approach.
△ Less
Submitted 18 September, 2012;
originally announced September 2012.
-
Feasible Automata for Two-Variable Logic with Successor on Data Words
Authors:
Ahmet Kara,
Thomas Schwentick,
Tony Tan
Abstract:
We introduce an automata model for data words, that is words that carry at each position a symbol from a finite alphabet and a value from an unbounded data domain. The model is (semantically) a restriction of data automata, introduced by Bojanczyk, et. al. in 2006, therefore it is called weak data automata. It is strictly less expressive than data automata and the expressive power is incomparable…
▽ More
We introduce an automata model for data words, that is words that carry at each position a symbol from a finite alphabet and a value from an unbounded data domain. The model is (semantically) a restriction of data automata, introduced by Bojanczyk, et. al. in 2006, therefore it is called weak data automata. It is strictly less expressive than data automata and the expressive power is incomparable with register automata. The expressive power of weak data automata corresponds exactly to existential monadic second order logic with successor +1 and data value equality \sim, EMSO2(+1,\sim). It follows from previous work, David, et. al. in 2010, that the nonemptiness problem for weak data automata can be decided in 2-NEXPTIME. Furthermore, we study weak Büchi automata on data omega-strings. They can be characterized by the extension of EMSO2(+1,\sim) with existential quantifiers for infinite sets. Finally, the same complexity bound for its nonemptiness problem is established by a nondeterministic polynomial time reduction to the nonemptiness problem of weak data automata.
△ Less
Submitted 6 October, 2011;
originally announced October 2011.
-
Semantic annotation of requirements for automatic UML class diagram generation
Authors:
Soumaya Amdouni,
Wahiba Ben Abdessalem Karaa,
Sondes Bouabid
Abstract:
The increasing complexity of software engineering requires effective methods and tools to support requirements analysts' activities. While much of a company's knowledge can be found in text repositories, current content management systems have limited capabilities for structuring and interpreting documents. In this context, we propose a tool for transforming text documents describing users' requir…
▽ More
The increasing complexity of software engineering requires effective methods and tools to support requirements analysts' activities. While much of a company's knowledge can be found in text repositories, current content management systems have limited capabilities for structuring and interpreting documents. In this context, we propose a tool for transforming text documents describing users' requirements to an UML model. The presented tool uses Natural Language Processing (NLP) and semantic rules to generate an UML class diagram. The main contribution of our tool is to provide assistance to designers facilitating the transition from a textual description of user requirements to their UML diagrams based on GATE (General Architecture of Text) by formulating necessary rules that generate new semantic annotations.
△ Less
Submitted 17 July, 2011;
originally announced July 2011.
-
Named Entity Recognition Using Web Document Corpus
Authors:
Wahiba Ben Abdessalem Karaa
Abstract:
This paper introduces a named entity recognition approach in textual corpus. This Named Entity (NE) can be a named: location, person, organization, date, time, etc., characterized by instances. A NE is found in texts accompanied by contexts: words that are left or right of the NE. The work mainly aims at identifying contexts inducing the NE's nature. As such, The occurrence of the word "President"…
▽ More
This paper introduces a named entity recognition approach in textual corpus. This Named Entity (NE) can be a named: location, person, organization, date, time, etc., characterized by instances. A NE is found in texts accompanied by contexts: words that are left or right of the NE. The work mainly aims at identifying contexts inducing the NE's nature. As such, The occurrence of the word "President" in a text, means that this word or context may be followed by the name of a president as President "Obama". Likewise, a word preceded by the string "footballer" induces that this is the name of a footballer. NE recognition may be viewed as a classification method, where every word is assigned to a NE class, regarding the context. The aim of this study is then to identify and classify the contexts that are most relevant to recognize a NE, those which are frequently found with the NE. A learning approach using training corpus: web documents, constructed from learning examples is then suggested. Frequency representations and modified tf-idf representations are used to calculate the context weights associated to context frequency, learning example frequency, and document frequency in the corpus.
△ Less
Submitted 28 February, 2011;
originally announced February 2011.
-
Extending Büchi Automata with Constraints on Data Values
Authors:
Ahmet Kara,
Tony Tan
Abstract:
Recently data trees and data words have received considerable amount of attention in connection with XML reasoning and system verification. These are trees or words that, in addition to labels from a finite alphabet, carry data values from an infinite alphabet (data). In general it is rather hard to obtain logics for data words and trees that are sufficiently expressive, but still have reasonable…
▽ More
Recently data trees and data words have received considerable amount of attention in connection with XML reasoning and system verification. These are trees or words that, in addition to labels from a finite alphabet, carry data values from an infinite alphabet (data). In general it is rather hard to obtain logics for data words and trees that are sufficiently expressive, but still have reasonable complexity for the satisfiability problem. In this paper we extend and study the notion of Büchi automata for omega-words with data. We prove that the emptiness problem for such extension is decidable in elementary complexity. We then apply our result to show the decidability of two kinds of logics for omega-words with data: the two-variable fragment of first-order logic and some extensions of classical linear temporal logic for omega-words with data.
△ Less
Submitted 23 June, 2012; v1 submitted 24 December, 2010;
originally announced December 2010.
-
Temporal Logics on Words with Multiple Data Values
Authors:
Ahmet Kara,
Thomas Schwentick,
Thomas Zeume
Abstract:
The paper proposes and studies temporal logics for attributed words, that is, data words with a (finite) set of (attribute,value)-pairs at each position. It considers a basic logic which is a semantical fragment of the logic $LTL^\downarrow_1$ of Demri and Lazic with operators for navigation into the future and the past. By reduction to the emptiness problem for data automata it is shown that this…
▽ More
The paper proposes and studies temporal logics for attributed words, that is, data words with a (finite) set of (attribute,value)-pairs at each position. It considers a basic logic which is a semantical fragment of the logic $LTL^\downarrow_1$ of Demri and Lazic with operators for navigation into the future and the past. By reduction to the emptiness problem for data automata it is shown that this basic logic is decidable. Whereas the basic logic only allows navigation to positions where a fixed data value occurs, extensions are studied that also allow navigation to positions with different data values. Besides some undecidable results it is shown that the extension by a certain UNTIL-operator with an inequality target condition remains decidable.
△ Less
Submitted 6 October, 2010;
originally announced October 2010.
-
On the Hybrid Extension of CTL and CTL+
Authors:
Ahmet Kara,
Martin Lange,
Thomas Schwentick,
Volker Weber
Abstract:
The paper studies the expressivity, relative succinctness and complexity of satisfiability for hybrid extensions of the branching-time logics CTL and CTL+ by variables. Previous complexity results show that only fragments with one variable do have elementary complexity. It is shown that H1CTL+ and H1CTL, the hybrid extensions with one variable of CTL+ and CTL, respectively, are expressively equi…
▽ More
The paper studies the expressivity, relative succinctness and complexity of satisfiability for hybrid extensions of the branching-time logics CTL and CTL+ by variables. Previous complexity results show that only fragments with one variable do have elementary complexity. It is shown that H1CTL+ and H1CTL, the hybrid extensions with one variable of CTL+ and CTL, respectively, are expressively equivalent but H1CTL+ is exponentially more succinct than H1CTL. On the other hand, HCTL+, the hybrid extension of CTL with arbitrarily many variables does not capture CTL*, as it even cannot express the simple CTL* property EGFp. The satisfiability problem for H1CTL+ is complete for triply exponential time, this remains true for quite weak fragments and quite strong extensions of the logic.
△ Less
Submitted 14 June, 2009;
originally announced June 2009.