Search | arXiv e-print repository

Multivariate Bayesian models with flexible shared interactions for analyzing spatio-temporal patterns of rare cancers

Authors: Garazi Retegui, Jaione Etxeberria, María Dolores Ugarte

Abstract: Rare cancers affect millions of people worldwide each year. However, estimating incidence or mortality rates associated with rare cancers presents important difficulties and poses new statistical methodological challenges. In this paper, we expand the collection of multivariate spatio-temporal models by introducing adaptable shared interactions to enable a comprehensive analysis of both incidence… ▽ More Rare cancers affect millions of people worldwide each year. However, estimating incidence or mortality rates associated with rare cancers presents important difficulties and poses new statistical methodological challenges. In this paper, we expand the collection of multivariate spatio-temporal models by introducing adaptable shared interactions to enable a comprehensive analysis of both incidence and cancer mortality in rare cancer cases. These models allow the modulation of spatio-temporal interactions between incidence and mortality, allowing for changes in their relationship over time. The new models have been implemented in INLA using r-generic constructions. We conduct a simulation study to evaluate the performance of the new spatio-temporal models in terms of sensitivity and specificity. Results show that multivariate spatio-temporal models with flexible shared interaction outperform conventional multivariate spatio-temporal models with independent interactions. We use these models to analyze incidence and mortality data for pancreatic cancer and leukaemia among males across 142 administrative healthcare districts of Great Britain over a span of nine biennial periods (2002-2019). △ Less

Submitted 15 March, 2024; originally announced March 2024.

Comments: 39 pages, 12 figures

arXiv:2403.07554 [pdf, other]

An Adaptive Learning Approach to Multivariate Time Forecasting in Industrial Processes

Authors: Fernando Miguelez, Josu Doncel, Maria Dolores Ugarte

Abstract: Industrial processes generate a massive amount of monitoring data that can be exploited to uncover hidden time losses in the system, leading to enhanced accuracy of maintenance policies and, consequently, increasing the effectiveness of the equipment. In this work, we propose a method for one-step probabilistic multivariate forecasting of time variables based on a Hidden Markov Model with covariat… ▽ More Industrial processes generate a massive amount of monitoring data that can be exploited to uncover hidden time losses in the system, leading to enhanced accuracy of maintenance policies and, consequently, increasing the effectiveness of the equipment. In this work, we propose a method for one-step probabilistic multivariate forecasting of time variables based on a Hidden Markov Model with covariates (IO-HMM). These covariates account for the correlation of the predicted variables with their past values and additional process measurements by means of a discrete model and a continuous model. The probabilities of the former are updated using Bayesian principles, while the parameter estimates for the latter are recursively computed through an adaptive algorithm that also admits a Bayesian interpretation. This approach permits the integration of new samples into the estimation of unknown parameters, computationally improving the efficiency of the process. We evaluate the performance of the method using a real data set obtained from a company of a particular sector; however, it is a versatile technique applicable to any other data set. The results show a consistent improvement over a persistence model, which assumes that future values are the same as current values, and more importantly, over univariate versions of our model. △ Less

Submitted 12 March, 2024; originally announced March 2024.

Comments: 19 pages, 6 figures

arXiv:2308.11260 [pdf, other]

doi 10.1016/j.spasta.2023.100804

A simplified spatial+ approach to mitigate spatial confounding in multivariate spatial areal models

Authors: A. Urdangarin, T. Goicoa, T. Kneib, M. D. Ugarte

Abstract: Spatial areal models encounter the well-known and challenging problem of spatial confounding. This issue makes it arduous to distinguish between the impacts of observed covariates and spatial random effects. Despite previous research and various proposed methods to tackle this problem, finding a definitive solution remains elusive. In this paper, we propose a simplified version of the spatial+ app… ▽ More Spatial areal models encounter the well-known and challenging problem of spatial confounding. This issue makes it arduous to distinguish between the impacts of observed covariates and spatial random effects. Despite previous research and various proposed methods to tackle this problem, finding a definitive solution remains elusive. In this paper, we propose a simplified version of the spatial+ approach that involves dividing the covariate into two components. One component captures large-scale spatial dependence, while the other accounts for short-scale dependence. This approach eliminates the need to separately fit spatial models for the covariates. We apply this method to analyse two forms of crimes against women, namely rapes and dowry deaths, in Uttar Pradesh, India, exploring their relationship with socio-demographic covariates. To evaluate the performance of the new approach, we conduct extensive simulation studies under different spatial confounding scenarios. The results demonstrate that the proposed method provides reliable estimates of fixed effects and posterior correlations between different responses. △ Less

Submitted 5 January, 2024; v1 submitted 22 August, 2023; originally announced August 2023.

Journal ref: Spatial Statistics (2024)

arXiv:2303.16549 [pdf, other]

doi 10.1002/bimj.202300096

A scalable approach for short-term disease forecasting in high spatial resolution areal data

Authors: E. Orozco-Acosta, A. Riebler, A. Adin, M. D. Ugarte

Abstract: Short-term disease forecasting at specific discrete spatial resolutions has become a high-impact decision-support tool in health planning. However, when the number of areas is very large obtaining predictions can be computationally intensive or even unfeasible using standard spatio-temporal models. The purpose of this paper is to provide a method for short-term predictions in high-dimensional area… ▽ More Short-term disease forecasting at specific discrete spatial resolutions has become a high-impact decision-support tool in health planning. However, when the number of areas is very large obtaining predictions can be computationally intensive or even unfeasible using standard spatio-temporal models. The purpose of this paper is to provide a method for short-term predictions in high-dimensional areal data based on a newly proposed ``divide-and-conquer" approach. We assess the predictive performance of this method and other classical spatio-temporal models in a validation study that uses cancer mortality data for the 7907 municipalities of continental Spain. The new proposal outperforms traditional models in terms of mean absolute error, root mean square error and interval score when forecasting cancer mortality one, two and three years ahead. Models are implemented in a fully Bayesian framework using the well-known integrated nested Laplace (INLA) estimation technique. △ Less

Submitted 29 March, 2023; originally announced March 2023.

Journal ref: Biometrical Journal (2023)

arXiv:2210.14849 [pdf, other]

doi 10.1007/s11222-023-10263-x

High-dimensional order-free multivariate spatial disease map**

Authors: G. Vicente, A. Adin, T. Goicoa, M. D. Ugarte

Abstract: Despite the amount of research on disease map** in recent years, the use of multivariate models for areal spatial data remains limited due to difficulties in implementation and computational burden. These problems are exacerbated when the number of small areas is very large. In this paper, we introduce an order-free multivariate scalable Bayesian modelling approach to smooth mortality (or incide… ▽ More Despite the amount of research on disease map** in recent years, the use of multivariate models for areal spatial data remains limited due to difficulties in implementation and computational burden. These problems are exacerbated when the number of small areas is very large. In this paper, we introduce an order-free multivariate scalable Bayesian modelling approach to smooth mortality (or incidence) risks of several diseases simultaneously. The proposal partitions the spatial domain into smaller subregions, fits multivariate models in each subdivision and obtains the posterior distribution of the relative risks across the entire spatial domain. The approach also provides posterior correlations among the spatial patterns of the diseases in each partition that are combined through a consensus Monte Carlo algorithm to obtain correlations for the whole study region. We implement the proposal using integrated nested Laplace approximations (INLA) in the R package bigDM and use it to jointly analyse colorectal, lung, and stomach cancer mortality data in Spanish municipalities. The new proposal permits the analysis of big data sets and provides better results than fitting a single multivariate model. △ Less

Submitted 26 October, 2022; originally announced October 2022.

Journal ref: Statistics and Computing (2023)

arXiv:2210.07046 [pdf, other]

doi 10.1007/s13163-022-00449-8

Evaluating recent methods to overcome spatial confounding

Authors: A. Urdangarin, T. Goicoa, M. D. Ugarte

Abstract: The concept of spatial confounding is closely connected to spatial regression, although no general definition has been established. A generally accepted idea of spatial confounding in spatial regression models is the change in fixed effects estimates that may occur when spatially correlated random effects collinear with the covariate are included in the model. Different methods have been proposed… ▽ More The concept of spatial confounding is closely connected to spatial regression, although no general definition has been established. A generally accepted idea of spatial confounding in spatial regression models is the change in fixed effects estimates that may occur when spatially correlated random effects collinear with the covariate are included in the model. Different methods have been proposed to alleviate spatial confounding in spatial linear regression models, but it is not clear if they provide correct fixed effects estimates. In this article, we consider some of those proposals to alleviate spatial confounding such as restricted regression, the spatial+ model, and transformed Gaussian Markov random fields. The objective is to determine which one provides the best estimates of the fixed effects. Dowry death data in Uttar Pradesh in 2001, stomach cancer incidence data in Slovenia in the period 1995-2001 and lip cancer incidence data in Scotland between the years 1975-1980 are analyzed. Several simulation studies are conducted to evaluate the performance of the methods in different scenarios of spatial confounding. Results reflect that the spatial+ method seems to provide fixed effects estimates closest to the true value. △ Less

Submitted 7 December, 2022; v1 submitted 13 October, 2022; originally announced October 2022.

arXiv:2208.13504 [pdf, other]

doi 10.1007/s11222-024-10383-y

Large-scale unsupervised spatio-temporal semantic analysis of vast regions from satellite images sequences

Authors: Carlos Echegoyen, Aritz Pérez, Guzmán Santafé, Unai Pérez-Goya, María Dolores Ugarte

Abstract: Temporal sequences of satellite images constitute a highly valuable and abundant resource for analyzing regions of interest. However, the automatic acquisition of knowledge on a large scale is a challenging task due to different factors such as the lack of precise labeled data, the definition and variability of the terrain entities, or the inherent complexity of the images and their fusion. In thi… ▽ More Temporal sequences of satellite images constitute a highly valuable and abundant resource for analyzing regions of interest. However, the automatic acquisition of knowledge on a large scale is a challenging task due to different factors such as the lack of precise labeled data, the definition and variability of the terrain entities, or the inherent complexity of the images and their fusion. In this context, we present a fully unsupervised and general methodology to conduct spatio-temporal taxonomies of large regions from sequences of satellite images. Our approach relies on a combination of deep embeddings and time series clustering to capture the semantic properties of the ground and its evolution over time, providing a comprehensive understanding of the region of interest. The proposed method is enhanced by a novel procedure specifically devised to refine the embedding and exploit the underlying spatio-temporal patterns. We use this methodology to conduct an in-depth analysis of a 220 km$^2$ region in northern Spain in different settings. The results provide a broad and intuitive perspective of the land where large areas are connected in a compact and well-structured manner, mainly based on climatic, phytological, and hydrological factors. △ Less

Submitted 14 February, 2024; v1 submitted 29 August, 2022; originally announced August 2022.

Journal ref: Statistics and Computing, Volume 34, article number 71, (2024)

arXiv:2202.03938 [pdf, other]

A Unique Cardiac Electrophysiological 3D Model

Authors: Cristina Rueda, Alejandro Rodríguez-Collado, Itziar Fernández, Christian Canedo, María Dolores Ugarte, Yolanda Larriba

Abstract: Mathematical models of cardiac electrical activity are one of the most important tools for elucidating information about the heart diagnostic. Even though it is one of the major problems in biomedical research, an efficient mathematical formulation for this modelling has still not been found. In this paper, we present an outstanding mathematical model. It relies on a five dipole representation o… ▽ More Mathematical models of cardiac electrical activity are one of the most important tools for elucidating information about the heart diagnostic. Even though it is one of the major problems in biomedical research, an efficient mathematical formulation for this modelling has still not been found. In this paper, we present an outstanding mathematical model. It relies on a five dipole representation of the cardiac electric source, each one associated with the well-known waves of the electrocardiogram signal. The mathematical formulation is simple enough to be easily parametrized and rich enough to provide realistic signals. Beyond the physical basis of the model, the parameters are physiologically interpretable as they characterize the wave shape, similar to what a physician would look for in signals, thus making them very useful in diagnosis. The model accurately reproduces the electrocardiogram and vectocardiogram signals of any diseased or healthy heart, bringing together different systems in a single model. Furthermore, a novel algorithm accurately identifies the model parameters. This new discovery represents a revolution in electrocardiography research, solving one of the main problems in this field. It is especially useful for the automatic diagnosis of cardiovascular diseases, patient follow-up or decision-making on new therapies. △ Less

Submitted 27 January, 2022; originally announced February 2022.

arXiv:2201.08323 [pdf, other]

doi 10.1016/j.cmpb.2023.107403

Big problems in spatio-temporal disease map**: methods and software

Authors: E. Orozco-Acosta, A. Adin, M. D. Ugarte

Abstract: Fitting spatio-temporal models for areal data is crucial in many fields such as cancer epidemiology. However, when data sets are very large, many issues arise. The main objective of this paper is to propose a general procedure to analyze high-dimensional spatio-temporal count data, with special emphasis on mortality/incidence relative risk estimation. We present a pragmatic and simple idea that pe… ▽ More Fitting spatio-temporal models for areal data is crucial in many fields such as cancer epidemiology. However, when data sets are very large, many issues arise. The main objective of this paper is to propose a general procedure to analyze high-dimensional spatio-temporal count data, with special emphasis on mortality/incidence relative risk estimation. We present a pragmatic and simple idea that permits to fit hierarchical spatio-temporal models when the number of small areas is very large. Model fitting is carried out using integrated nested Laplace approximations over a partition of the spatial domain. We also use parallel and distributed strategies to speed up computations in a setting where Bayesian model fitting is generally prohibitively time-consuming and even unfeasible. Using simulated and real data, we show that our method outperforms classical global models. We implement the methods and algorithms that we develop in the open-source R package bigDM where specific vignettes have been included to facilitate the use of the methodology for non-expert users. Our scalable methodology proposal provides reliable risk estimates when fitting Bayesian hierarchical spatio-temporal models for high-dimensional data. △ Less

Submitted 11 October, 2022; v1 submitted 20 January, 2022; originally announced January 2022.

Journal ref: Computer Methods and Programs in Biomedicine (2023)

arXiv:2007.07724 [pdf, other]

doi 10.1016/j.spasta.2021.100496

Scalable Bayesian modeling for smoothing disease risks in large spatial data sets

Authors: E. Orozco-Acosta, A. Adin, M. D. Ugarte

Abstract: Several methods have been proposed in the spatial statistics literature for the analysis of big data sets in continuous domains. However, new methods for analyzing high-dimensional areal data are still scarce. Here, we propose a scalable Bayesian modeling approach for smoothing mortality (or incidence) risks in high-dimensional data, that is, when the number of small areas is very large. The metho… ▽ More Several methods have been proposed in the spatial statistics literature for the analysis of big data sets in continuous domains. However, new methods for analyzing high-dimensional areal data are still scarce. Here, we propose a scalable Bayesian modeling approach for smoothing mortality (or incidence) risks in high-dimensional data, that is, when the number of small areas is very large. The method is implemented in the R add-on package bigDM. Model fitting and inference is based on the idea of "divide and conquer" and use integrated nested Laplace approximations and numerical integration. We analyze the proposal's empirical performance in a comprehensive simulation study that consider two model-free settings. Finally, the methodology is applied to analyze male colorectal cancer mortality in Spanish municipalities showing its benefits with regard to the standard approach in terms of goodness of fit and computational time. △ Less

Submitted 15 July, 2020; originally announced July 2020.

Journal ref: Spatial Statistics (2021), 41, 100496

arXiv:2003.01946 [pdf, ps, other]

doi 10.1177/1471082X211015452

Alleviating confounding in spatio-temporal areal models with an application on crimes against women in India

Authors: A. Adin, T. Goicoa, J. S. Hodges, P. Schnell, M. D. Ugarte

Abstract: Assessing associations between a response of interest and a set of covariates in spatial areal models is the leitmotiv of ecological regression. However, the presence of spatially correlated random effects can mask or even bias estimates of such associations due to confounding effects if they are not carefully handled. Though potentially harmful, confounding issues have often been ignored in pract… ▽ More Assessing associations between a response of interest and a set of covariates in spatial areal models is the leitmotiv of ecological regression. However, the presence of spatially correlated random effects can mask or even bias estimates of such associations due to confounding effects if they are not carefully handled. Though potentially harmful, confounding issues have often been ignored in practice leading to wrong conclusions about the underlying associations between the response and the covariates. In spatio-temporal areal models, the temporal dimension may emerge as a new source of confounding, and the problem may be even worse. In this work, we propose two approaches to deal with confounding of fixed effects by spatial and temporal random effects, while obtaining good model predictions. In particular, restricted regression and an apparently -- though in fact not -- equivalent procedure using constraints are proposed within both fully Bayes and empirical Bayes approaches. The methods are compared in terms of fixed-effect estimates and model selection criteria. The techniques are used to assess the association between dowry deaths and certain socio-demographic covariates in the districts of Uttar Pradesh, India. △ Less

Submitted 7 April, 2021; v1 submitted 4 March, 2020; originally announced March 2020.

Journal ref: Statistical Modelling 2021

arXiv:2002.01859 [pdf, other]

RGISTools: Downloading, Customizing, and Processing Time Series of Remote Sensing Data in R

Authors: U. Pérez-Goya, M. Montesino-SanMartin, A. F. Militino, M. D. Ugarte

Abstract: There is a large number of data archives and web services offering free access to multispectral satellite imagery. Images from multiple sources are increasingly combined to improve the spatio-temporal coverage of measurements while achieving more accurate results. Archives and web services differ in their protocols, formats, and data standards, which are barriers to combine datasets. Here, we pres… ▽ More There is a large number of data archives and web services offering free access to multispectral satellite imagery. Images from multiple sources are increasingly combined to improve the spatio-temporal coverage of measurements while achieving more accurate results. Archives and web services differ in their protocols, formats, and data standards, which are barriers to combine datasets. Here, we present RGISTools, an R package to create time-series of multispectral satellite images from multiple platforms in a harmonized and standardized way. We first provide an overview of the package functionalities, namely downloading, customizing, and processing multispectral satellite imagery for a region and time period of interest as well as a recent statistical method for gap-filling and smoothing series of images, called interpolation of the mean anomalies. We further show the capabilities of the package through a case study that combines Landsat-8 and Sentinel-2 satellite optical imagery to estimate the level of a water reservoir in Northern Spain. We expect RGISTools to foster research on data fusion and spatio-temporal modelling using satellite images from multiple programs. △ Less

Submitted 5 February, 2020; originally announced February 2020.

Comments: 31 pages, 6 figures

arXiv:1905.09848 [pdf, other]

Conjunctive Queries with Theta Joins Under Updates

Authors: Muhammad Idris, Martín Ugarte, Stijn Vansummeren, Hannes Voigt, Wolfgang Lehner

Abstract: Modern application domains such as Composite Event Recognition (CER) and real-time Analytics require the ability to dynamically refresh query results under high update rates. Traditional approaches to this problem are based either on the materialization of subresults (to avoid their recomputation) or on the recomputation of subresults (to avoid the space overhead of materialization). Both techniqu… ▽ More Modern application domains such as Composite Event Recognition (CER) and real-time Analytics require the ability to dynamically refresh query results under high update rates. Traditional approaches to this problem are based either on the materialization of subresults (to avoid their recomputation) or on the recomputation of subresults (to avoid the space overhead of materialization). Both techniques have recently been shown suboptimal: instead of materializing results and subresults, one can maintain a data structure that supports efficient maintenance under updates and can quickly enumerate the full query output, as well as the changes produced under single updates. Unfortunately, these data structures have been developed only for aggregate-join queries composed of equi-joins, limiting their applicability in domains such as CER where temporal joins are commonplace. In this paper, we present a new approach for dynamically evaluating queries with multi-way theta-joins under updates that is effective in avoiding both materialization and recomputation of results, while supporting a wide range of applications. To do this we generalize Dynamic Yannakakis, an algorithm for dynamically processing acyclic equi-join queries. In tandem, and of independent interest, we generalize the notions of acyclicity and free-connexity to arbitrary theta-joins and show how to compute corresponding join trees. We instantiate our framework to the case where theta-joins are only composed of equalities and inequalities and experimentally compare our algorithm to state of the art CER systems as well as incremental view maintenance engines. Our approach performs consistently better than the competitor systems with up to two orders of magnitude improvements in both time and memory consumption. △ Less

Submitted 23 May, 2019; originally announced May 2019.

arXiv:1808.05602 [pdf, other]

doi 10.3390/e20100793

An information-theoretic approach to self-organisation: Emergence of complex interdependencies in coupled dynamical systems

Authors: Fernando Rosas, Pedro A. M. Mediano, Martin Ugarte, Henrik J. Jensen

Abstract: Self-organisation lies at the core of fundamental but still unresolved scientific questions, and holds the promise of de-centralised paradigms crucial for future technological developments. While self-organising processes have been traditionally explained by the tendency of dynamical systems to evolve towards specific configurations, or attractors, we see self-organisation as a consequence of the… ▽ More Self-organisation lies at the core of fundamental but still unresolved scientific questions, and holds the promise of de-centralised paradigms crucial for future technological developments. While self-organising processes have been traditionally explained by the tendency of dynamical systems to evolve towards specific configurations, or attractors, we see self-organisation as a consequence of the interdependencies that those attractors induce. Building on this intuition, in this work we develop a theoretical framework for understanding and quantifying self-organisation based on coupled dynamical systems and multivariate information theory. We propose a metric of global structural strength that identifies when self-organisation appears, and a multi-layered decomposition that explains the emergent structure in terms of redundant and synergistic interdependencies. We illustrate our framework on elementary cellular automata, showing how it can detect and characterise the emergence of complex structures. △ Less

Submitted 14 April, 2019; v1 submitted 16 August, 2018; originally announced August 2018.

Comments: 25 pages, 4 figures

arXiv:1803.05277 [pdf, ps, other]

Constant delay algorithms for regular document spanners

Authors: Fernando Florenzano, Cristian Riveros, Martin Ugarte, Stijn Vansummeren, Domagoj Vrgoc

Abstract: Regular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners, use regular languages in order to locate the data that a user wants to extract from a text document, and then store this data into variables. Since document spanners can easily generate large outputs, it is important to have go… ▽ More Regular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners, use regular languages in order to locate the data that a user wants to extract from a text document, and then store this data into variables. Since document spanners can easily generate large outputs, it is important to have good evaluation algorithms that can generate the extracted data in a quick succession, and with relatively little precomputation time. Towards this goal, we present a practical evaluation algorithm that allows constant delay enumeration of a spanner's output after a precomputation phase that is linear in the document. While the algorithm assumes that the spanner is specified in a syntactic variant of variable set automata, we also study how it can be applied when the spanner is specified by general variable set automata, regex formulas, or spanner algebras. Finally, we study the related problem of counting the number of outputs of a document spanner, providing a fine grained analysis of the classes of document spanners that support efficient enumeration of their results. △ Less

Submitted 14 March, 2018; originally announced March 2018.

arXiv:1712.01063 [pdf, other]

A Second-Order Approach to Complex Event Recognition

Authors: Alejandro Grez, Cristian Riveros, Martin Ugarte, Stijn Vansummeren

Abstract: Complex Event Recognition (CER for short) refers to the activity of detecting patterns in streams of continuously arriving data. This field has been traditionally approached from a practical point of view, resulting in heterogeneous implementations with fundamentally different capabilities. The main reason behind this is that defining formal semantics for a CER language is not trivial: they usuall… ▽ More Complex Event Recognition (CER for short) refers to the activity of detecting patterns in streams of continuously arriving data. This field has been traditionally approached from a practical point of view, resulting in heterogeneous implementations with fundamentally different capabilities. The main reason behind this is that defining formal semantics for a CER language is not trivial: they usually combine first-order variables for joining and filtering events with regular operators like sequencing and Kleene closure. Moreover, their semantics usually focus only on the detection of complex events, leaving the concept of output mostly unattended. In this paper, we propose to unify the semantics and output of complex event recognition languages by using second order objects. Specifically, we introduce a CER language called Second Order Complex Event Logic (SO-CEL for short), that uses second order variables for managing and outputting sequences of events. This makes the definition of the main CER operators simple, allowing us to develop the first steps in understanding its expressive power. We start by comparing SO-CEL with a version that uses first-order variables called FO-CEL, showing that they are equivalent in expressive power when restricted to unary predicates but, surprisingly, incomparable in general. Nevertheless, we show that if we restrict to sets of binary predicates, then SO-CEL is strictly more expressive than FO-CEL. Then, we introduce a natural computational model called Unary Complex Event Automata (UCEA) that provides a better understanding of SO-CEL. We show that, under unary predicates, SO-CEL captures the subclass of UCEA that satisfy the so-called *-property. Finally, we identify the operations that SO-CEL is lacking to capture UCEA and introduce a natural extension of the language that captures the complete class of UCEA under unary predicates. △ Less

Submitted 4 December, 2017; originally announced December 2017.

arXiv:1709.05369 [pdf, other]

Foundations of Complex Event Processing

Authors: Marco Bucchi, Alejandro Grez, Cristian Riveros, Martín Ugarte

Abstract: Complex Event Processing (CEP) has emerged as the unifying field for technologies that require processing and correlating distributed data sources in real-time. CEP finds applications in diverse domains, which has resulted in a large number of proposals for expressing and processing complex events. However, existing CEP languages lack from a clear semantics, making them hard to understand and gene… ▽ More Complex Event Processing (CEP) has emerged as the unifying field for technologies that require processing and correlating distributed data sources in real-time. CEP finds applications in diverse domains, which has resulted in a large number of proposals for expressing and processing complex events. However, existing CEP languages lack from a clear semantics, making them hard to understand and generalize. Moreover, there are no general techniques for evaluating CEP query languages with clear performance guarantees. In this paper we embark on the task of giving a rigorous and efficient framework to CEP. We propose a formal language for specifying complex events, called CEL, that contains the main features used in the literature and has a denotational and compositional semantics. We also formalize the so-called selection strategies, which had only been presented as by-design extensions to existing frameworks. With a well-defined semantics at hand, we study how to efficiently evaluate CEL for processing complex events in the case of unary filters. We start by studying the syntactical properties of CEL and propose rewriting optimization techniques for simplifying the evaluation of formulas. Then, we introduce a formal computational model for CEP, called complex event automata (CEA), and study how to compile CEL formulas into CEA. Furthermore, we provide efficient algorithms for evaluating CEA over event streams using constant time per event followed by constant-delay enumeration of the results. By gathering these results together, we propose a framework for efficiently evaluating CEL with unary filters. Finally, we show experimentally that this framework consistently outperforms the competition, and even over trivial queries can be orders of magnitude more efficient. △ Less

Submitted 30 August, 2018; v1 submitted 15 September, 2017; originally announced September 2017.

Comments: Conference version

arXiv:1405.6416 [pdf, other]

Discussion of "Single and Two-Stage Cross-Sectional and Time Series Benchmarking Procedures for SAE"

Authors: Rebecca C. Steorts, M. Delores Ugarte

Abstract: We congratulate the authors for a stimulating and valuable manuscript, providing a careful review of the state-of the-art in cross-sectional and time-series benchmarking procedures for small area estimation. They develop a novel two-stage benchmarking method for hierarchical time series models, where they evaluate their procedure by estimating monthly total unemployment using data from the U.S. Ce… ▽ More We congratulate the authors for a stimulating and valuable manuscript, providing a careful review of the state-of the-art in cross-sectional and time-series benchmarking procedures for small area estimation. They develop a novel two-stage benchmarking method for hierarchical time series models, where they evaluate their procedure by estimating monthly total unemployment using data from the U.S. Census Bureau. We discuss three topics: linearity and model misspecification, computational complexity and model comparisons, and, some aspects on small area estimation in practice. More specifically, we pose the following questions to the authors, that they may wish to answer: How robust is their model to misspecification? Is it time to perhaps move away from linear models of the type considered by (Battese et al. 1988; Fay and Herriot 1979)? What is the asymptotic computational complexity and what comparisons can be made to other models? Should the benchmarking constraints be inherently fixed or should they be random? △ Less

Submitted 25 May, 2014; originally announced May 2014.

Comments: 6 pages, 1 figure

Showing 1–18 of 18 results for author: Ugarte, M