Search | arXiv e-print repository

PDFA Distillation via String Probability Queries

Authors: Robert Baumgartner, Sicco Verwer

Abstract: Probabilistic deterministic finite automata (PDFA) are discrete event systems modeling conditional probabilities over languages: Given an already seen sequence of tokens they return the probability of tokens of interest to appear next. These types of models have gained interest in the domain of explainable machine learning, where they are used as surrogate models for neural networks trained as lan… ▽ More Probabilistic deterministic finite automata (PDFA) are discrete event systems modeling conditional probabilities over languages: Given an already seen sequence of tokens they return the probability of tokens of interest to appear next. These types of models have gained interest in the domain of explainable machine learning, where they are used as surrogate models for neural networks trained as language models. In this work we present an algorithm to distill PDFA from neural networks. Our algorithm is a derivative of the L# algorithm and capable of learning PDFA from a new type of query, in which the algorithm infers conditional probabilities from the probability of the queried string to occur. We show its effectiveness on a recent public dataset by distilling PDFA from a set of trained neural networks. △ Less

Submitted 28 June, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

Comments: LearnAUT 2024

arXiv:2406.07208 [pdf, other]

Database-assisted automata learning

Authors: Hielke Walinga, Robert Baumgartner, Sicco Verwer

Abstract: This paper presents DAALder (Database-Assisted Automata Learning, with Dutch suffix from leerder), a new algorithm for learning state machines, or automata, specifically deterministic finite-state automata (DFA). When learning state machines from log data originating from software systems, the large amount of log data can pose a challenge. Conventional state merging algorithms cannot efficiently d… ▽ More This paper presents DAALder (Database-Assisted Automata Learning, with Dutch suffix from leerder), a new algorithm for learning state machines, or automata, specifically deterministic finite-state automata (DFA). When learning state machines from log data originating from software systems, the large amount of log data can pose a challenge. Conventional state merging algorithms cannot efficiently deal with this, as they require a large amount of memory. To solve this, we utilized database technologies to efficiently query a big trace dataset and construct a state machine from it, as databases allow to save large amounts of data on disk while still being able to query it efficiently. Building on research in both active learning and passive learning, the proposed algorithm is a combination of the two. It can quickly find a characteristic set of traces from a database using heuristics from a state merging algorithm. Experiments show that our algorithm has similar performance to conventional state merging algorithms on large datasets, but requires far less memory. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: 8 pages body, 12 pages total, LearnAut 2024 Keywords: Active/Passive state machine learning, Incomplete Minimally Adequate Teacher

arXiv:2405.19260 [pdf, other]

Hilbert Space Diffusion in Systems with Approximate Symmetries

Authors: Rahel L. Baumgartner, Luca V. Delacrétaz, Pranjal Nayak, Julian Sonner

Abstract: Random matrix theory (RMT) universality is the defining property of quantum mechanical chaotic systems, and can be probed by observables like the spectral form factor (SFF). In this paper, we describe systematic deviations from RMT behaviour at intermediate time scales in systems with approximate symmetries. At early times, the symmetries allow us to organize the Hilbert space into approximately d… ▽ More Random matrix theory (RMT) universality is the defining property of quantum mechanical chaotic systems, and can be probed by observables like the spectral form factor (SFF). In this paper, we describe systematic deviations from RMT behaviour at intermediate time scales in systems with approximate symmetries. At early times, the symmetries allow us to organize the Hilbert space into approximately decoupled sectors, each of which contributes independently to the SFF. At late times, the SFF transitions into the final ramp of the fully mixed chaotic Hamiltonian. For approximate continuous symmetries, the transitional behaviour is governed by a universal process that we call Hilbert space diffusion. The diffusion constant corresponding to this process is related to the relaxation rate of the associated nearly conserved charge. By implementing a chaotic sigma model for Hilbert-space diffusion, we formulate an analytic theory of this process which agrees quantitatively with our numerical results for different examples. △ Less

Submitted 29 May, 2024; originally announced May 2024.

Comments: 32 pages + appendices, 4 figures

arXiv:2402.03447 [pdf, other]

Challenges in Variable Importance Ranking Under Correlation

Authors: Annie Liang, Thomas Jemielita, Andy Liaw, Vladimir Svetnik, Lingkang Huang, Richard Baumgartner, Jason M. Klusowski

Abstract: Variable importance plays a pivotal role in interpretable machine learning as it helps measure the impact of factors on the output of the prediction model. Model agnostic methods based on the generation of "null" features via permutation (or related approaches) can be applied. Such analysis is often utilized in pharmaceutical applications due to its ability to interpret black-box models, including… ▽ More Variable importance plays a pivotal role in interpretable machine learning as it helps measure the impact of factors on the output of the prediction model. Model agnostic methods based on the generation of "null" features via permutation (or related approaches) can be applied. Such analysis is often utilized in pharmaceutical applications due to its ability to interpret black-box models, including tree-based ensembles. A major challenge and significant confounder in variable importance estimation however is the presence of between-feature correlation. Recently, several adjustments to marginal permutation utilizing feature knockoffs were proposed to address this issue, such as the variable importance measure known as conditional predictive impact (CPI). Assessment and evaluation of such approaches is the focus of our work. We first present a comprehensive simulation study investigating the impact of feature correlation on the assessment of variable importance. We then theoretically prove the limitation that highly correlated features pose for the CPI through the knockoff construction. While we expect that there is always no correlation between knockoff variables and its corresponding predictor variables, we prove that the correlation increases linearly beyond a certain correlation threshold between the predictor variables. Our findings emphasize the absence of free lunch when dealing with high feature correlation, as well as the necessity of understanding the utility and limitations behind methods in variable importance estimation. △ Less

Submitted 5 February, 2024; originally announced February 2024.

arXiv:2309.01823 [pdf]

Multi-dimension unified Swin Transformer for 3D Lesion Segmentation in Multiple Anatomical Locations

Authors: Shaoyan Pan, Yiqiao Liu, Sarah Halek, Michal Tomaszewski, Shubing Wang, Richard Baumgartner, Jianda Yuan, Gregory Goldmacher, Antong Chen

Abstract: In oncology research, accurate 3D segmentation of lesions from CT scans is essential for the modeling of lesion growth kinetics. However, following the RECIST criteria, radiologists routinely only delineate each lesion on the axial slice showing the largest transverse area, and delineate a small number of lesions in 3D for research purposes. As a result, we have plenty of unlabeled 3D volumes and… ▽ More In oncology research, accurate 3D segmentation of lesions from CT scans is essential for the modeling of lesion growth kinetics. However, following the RECIST criteria, radiologists routinely only delineate each lesion on the axial slice showing the largest transverse area, and delineate a small number of lesions in 3D for research purposes. As a result, we have plenty of unlabeled 3D volumes and labeled 2D images, and scarce labeled 3D volumes, which makes training a deep-learning 3D segmentation model a challenging task. In this work, we propose a novel model, denoted a multi-dimension unified Swin transformer (MDU-ST), for 3D lesion segmentation. The MDU-ST consists of a Shifted-window transformer (Swin-transformer) encoder and a convolutional neural network (CNN) decoder, allowing it to adapt to 2D and 3D inputs and learn the corresponding semantic information in the same encoder. Based on this model, we introduce a three-stage framework: 1) leveraging large amount of unlabeled 3D lesion volumes through self-supervised pretext tasks to learn the underlying pattern of lesion anatomy in the Swin-transformer encoder; 2) fine-tune the Swin-transformer encoder to perform 2D lesion segmentation with 2D RECIST slices to learn slice-level segmentation information; 3) further fine-tune the Swin-transformer encoder to perform 3D lesion segmentation with labeled 3D volumes. The network's performance is evaluated by the Dice similarity coefficient (DSC) and Hausdorff distance (HD) using an internal 3D lesion dataset with 593 lesions extracted from multiple anatomical locations. The proposed MDU-ST demonstrates significant improvement over the competing models. The proposed method can be used to conduct automated 3D lesion segmentation to assist radiomics and tumor growth modeling studies. This paper has been accepted by the IEEE International Symposium on Biomedical Imaging (ISBI) 2023. △ Less

Submitted 4 September, 2023; originally announced September 2023.

arXiv:2208.10605 [pdf, other]

SoK: Explainable Machine Learning for Computer Security Applications

Authors: Azqa Nadeem, Daniël Vos, Clinton Cao, Luca Pajola, Simon Dieck, Robert Baumgartner, Sicco Verwer

Abstract: Explainable Artificial Intelligence (XAI) aims to improve the transparency of machine learning (ML) pipelines. We systematize the increasingly growing (but fragmented) microcosm of studies that develop and utilize XAI methods for defensive and offensive cybersecurity tasks. We identify 3 cybersecurity stakeholders, i.e., model users, designers, and adversaries, who utilize XAI for 4 distinct objec… ▽ More Explainable Artificial Intelligence (XAI) aims to improve the transparency of machine learning (ML) pipelines. We systematize the increasingly growing (but fragmented) microcosm of studies that develop and utilize XAI methods for defensive and offensive cybersecurity tasks. We identify 3 cybersecurity stakeholders, i.e., model users, designers, and adversaries, who utilize XAI for 4 distinct objectives within an ML pipeline, namely 1) XAI-enabled user assistance, 2) XAI-enabled model verification, 3) explanation verification & robustness, and 4) offensive use of explanations. Our analysis of the literature indicates that many of the XAI applications are designed with little understanding of how they might be integrated into analyst workflows -- user studies for explanation evaluation are conducted in only 14% of the cases. The security literature sometimes also fails to disentangle the role of the various stakeholders, e.g., by providing explanations to model users and designers while also exposing them to adversaries. Additionally, the role of model designers is particularly minimized in the security literature. To this end, we present an illustrative tutorial for model designers, demonstrating how XAI can help with model verification. We also discuss scenarios where interpretability by design may be a better alternative. The systematization and the tutorial enable us to challenge several assumptions, and present open problems that can help shape the future of XAI research within cybersecurity. △ Less

Submitted 3 March, 2023; v1 submitted 22 August, 2022; originally announced August 2022.

Comments: 13 pages. Accepted at Euro S&P

arXiv:2207.01516 [pdf, other]

Learning state machines via efficient hashing of future traces

Authors: Robert Baumgartner, Sicco Verwer

Abstract: State machines are popular models to model and visualize discrete systems such as software systems, and to represent regular grammars. Most algorithms that passively learn state machines from data assume all the data to be available from the beginning and they load this data into memory. This makes it hard to apply them to continuously streaming data and results in large memory requirements when d… ▽ More State machines are popular models to model and visualize discrete systems such as software systems, and to represent regular grammars. Most algorithms that passively learn state machines from data assume all the data to be available from the beginning and they load this data into memory. This makes it hard to apply them to continuously streaming data and results in large memory requirements when dealing with large datasets. In this paper we propose a method to learn state machines from data streams using the count-min-sketch data structure to reduce memory requirements. We apply state merging using the well-known red-blue-framework to reduce the search space. We implemented our approach in an established framework for learning state machines, and evaluated it on a well know dataset to provide experimental data, showing the effectiveness of our approach with respect to quality of the results and run-time. △ Less

Submitted 4 July, 2022; originally announced July 2022.

arXiv:2206.14255 [pdf, other]

Target alignment in truncated kernel ridge regression

Authors: Arash A. Amini, Richard Baumgartner, Dai Feng

Abstract: Kernel ridge regression (KRR) has recently attracted renewed interest due to its potential for explaining the transient effects, such as double descent, that emerge during neural network training. In this work, we study how the alignment between the target function and the kernel affects the performance of the KRR. We focus on the truncated KRR (TKRR) which utilizes an additional parameter that co… ▽ More Kernel ridge regression (KRR) has recently attracted renewed interest due to its potential for explaining the transient effects, such as double descent, that emerge during neural network training. In this work, we study how the alignment between the target function and the kernel affects the performance of the KRR. We focus on the truncated KRR (TKRR) which utilizes an additional parameter that controls the spectral truncation of the kernel matrix. We show that for polynomial alignment, there is an \emph{over-aligned} regime, in which TKRR can achieve a faster rate than what is achievable by full KRR. The rate of TKRR can improve all the way to the parametric rate, while that of full KRR is capped at a sub-optimal value. This shows that target alignemnt can be better leveraged by utilizing spectral truncation in kernel methods. We also consider the bandlimited alignment setting and show that the regularization surface of TKRR can exhibit transient effects including multiple descent and non-monotonic behavior. Our results show that there is a strong and quantifable relation between the shape of the \emph{alignment spectrum} and the generalization performance of kernel methods, both in terms of rates and in finite samples. △ Less

Submitted 28 June, 2022; originally announced June 2022.

arXiv:2108.08752 [pdf, other]

A Framework for an Assessment of the Kernel-target Alignment in Tree Ensemble Kernel Learning

Authors: Dai Feng, Richard Baumgartner

Abstract: Kernels ensuing from tree ensembles such as random forest (RF) or gradient boosted trees (GBT), when used for kernel learning, have been shown to be competitive to their respective tree ensembles (particularly in higher dimensional scenarios). On the other hand, it has been also shown that performance of the kernel algorithms depends on the degree of the kernel-target alignment. However, the kerne… ▽ More Kernels ensuing from tree ensembles such as random forest (RF) or gradient boosted trees (GBT), when used for kernel learning, have been shown to be competitive to their respective tree ensembles (particularly in higher dimensional scenarios). On the other hand, it has been also shown that performance of the kernel algorithms depends on the degree of the kernel-target alignment. However, the kernel-target alignment for kernel learning based on the tree ensembles has not been investigated and filling this gap is the main goal of our work. Using the eigenanalysis of the kernel matrix, we demonstrate that for continuous targets good performance of the tree-based kernel learning is associated with strong kernel-target alignment. Moreover, we show that well performing tree ensemble based kernels are characterized by strong target aligned components that are expressed through scalar products between the eigenvectors of the kernel matrix and the target. This suggests that when tree ensemble based kernel learning is successful, relevant information for the supervised problem is concentrated near lower dimensional manifold spanned by the target aligned components. Persistence of the strong target aligned components in tree ensemble based kernels is further supported by sensitivity analysis via landmark learning. In addition to a comprehensive simulation study, we also provide experimental results from several real life data sets that are in line with the simulations. △ Less

Submitted 19 August, 2021; originally announced August 2021.

arXiv:2106.14109 [pdf]

Parmsurv: a SAS Macro for Flexible Parametric Survival Analysis with Long-Term Predictions

Authors: Han Fu, Shahrul Mt-Isa, Richard Baumgartner, William Malbecq

Abstract: Health economic evaluations often require predictions of survival rates beyond the follow-up period. Parametric survival models can be more convenient for economic modelling than the Cox model. The generalized gamma (GG) and generalized F (GF) distributions are extensive families that contain almost all commonly used distributions with various hazard shapes and arbitrary complexity. In this study,… ▽ More Health economic evaluations often require predictions of survival rates beyond the follow-up period. Parametric survival models can be more convenient for economic modelling than the Cox model. The generalized gamma (GG) and generalized F (GF) distributions are extensive families that contain almost all commonly used distributions with various hazard shapes and arbitrary complexity. In this study, we present a new SAS macro for implementing a wide variety of flexible parametric models including the GG and GF distributions and their special cases, as well as the Gompertz distribution. Proper custom distributions are also supported. Different from existing SAS procedures, this macro not only supports regression on the location parameter but also on ancillary parameters, which greatly increases model flexibility. In addition, the SAS macro supports weighted regression, stratified regression and robust inference. This study demonstrates with several examples how the SAS macro can be used for flexible survival modeling and extrapolation. △ Less

Submitted 12 July, 2022; v1 submitted 26 June, 2021; originally announced June 2021.

Comments: 15 pages, 1 figure, 10 tables, accepted by The Clinical Data Science Conference - PHUSE US Connect 2021

arXiv:2101.06206 [pdf]

doi 10.1016/j.colsurfa.2021.127191

Influence of PEG on the Clustering of Active Janus Colloids

Authors: Mohammed A. Kalil, Nicky R. Baumgartner, Marola W. Issa, Shawn D. Ryan, Christopher L. Wirth

Abstract: Micrometer scale colloidal particles that propel in a deterministic fashion in response to local environmental cues are useful analogs to self-propelling entities found in nature. Both natural and synthetic active colloidal systems are often near boundaries or are located in crowded environments. Herein, we describe experiments in which we measured the influence of hydrogen peroxide concentration… ▽ More Micrometer scale colloidal particles that propel in a deterministic fashion in response to local environmental cues are useful analogs to self-propelling entities found in nature. Both natural and synthetic active colloidal systems are often near boundaries or are located in crowded environments. Herein, we describe experiments in which we measured the influence of hydrogen peroxide concentration and dispersed polyethylene glycol (PEG) on the clustering behavior of 5 micrometer catalytic active Janus particles at low concentration. We found the extent to which clustering occurred in ensembles of active Janus particles grew with hydrogen peroxide concentration in the absence of PEG. Once PEG was added, clustering was slightly enhanced at low PEG volume fractions, but was reduced at higher PEG volumes fractions. The region in which clustering was mitigated at higher PEG volume fractions corresponded to the region in which propulsion was previously found to be quenched. Complementary agent based simulations showed that clustering grew with nominal speed. These data support the hypothesis that growth of living crystals is enhanced with increases in propulsion speed, but the addition of PEG will tend to mitigate cluster formation as a consequence of quenched propulsion at these conditions. △ Less

Submitted 15 January, 2021; originally announced January 2021.

arXiv:2012.10737 [pdf, other]

(Decision and regression) tree ensemble based kernels for regression and classification

Authors: Dai Feng, Richard Baumgartner

Abstract: Tree based ensembles such as Breiman's random forest (RF) and Gradient Boosted Trees (GBT) can be interpreted as implicit kernel generators, where the ensuing proximity matrix represents the data-driven tree ensemble kernel. Kernel perspective on the RF has been used to develop a principled framework for theoretical investigation of its statistical properties. Recently, it has been shown that the… ▽ More Tree based ensembles such as Breiman's random forest (RF) and Gradient Boosted Trees (GBT) can be interpreted as implicit kernel generators, where the ensuing proximity matrix represents the data-driven tree ensemble kernel. Kernel perspective on the RF has been used to develop a principled framework for theoretical investigation of its statistical properties. Recently, it has been shown that the kernel interpretation is germane to other tree-based ensembles e.g. GBTs. However, practical utility of the links between kernels and the tree ensembles has not been widely explored and systematically evaluated. Focus of our work is investigation of the interplay between kernel methods and the tree based ensembles including the RF and GBT. We elucidate the performance and properties of the RF and GBT based kernels in a comprehensive simulation study comprising of continuous and binary targets. We show that for continuous targets, the RF/GBT kernels are competitive to their respective ensembles in higher dimensional scenarios, particularly in cases with larger number of noisy features. For the binary target, the RF/GBT kernels and their respective ensembles exhibit comparable performance. We provide the results from real life data sets for regression and classification to show how these insights may be leveraged in practice. Overall, our results support the tree ensemble based kernels as a valuable addition to the practitioner's toolbox. Finally, we discuss extensions of the tree ensemble based kernels for survival targets, interpretable prototype and landmarking classification and regression. We outline future line of research for kernels furnished by Bayesian counterparts of the frequentist tree ensembles. △ Less

Submitted 19 December, 2020; originally announced December 2020.

Comments: arXiv admin note: substantial text overlap with arXiv:2009.00089

arXiv:2009.00089 [pdf, other]

Random Forest (RF) Kernel for Regression, Classification and Survival

Authors: Dai Feng, Richard Baumgartner

Abstract: Breiman's random forest (RF) can be interpreted as an implicit kernel generator,where the ensuing proximity matrix represents the data-driven RF kernel. Kernel perspective on the RF has been used to develop a principled framework for theoretical investigation of its statistical properties. However, practical utility of the links between kernels and the RF has not been widely explored and systemati… ▽ More Breiman's random forest (RF) can be interpreted as an implicit kernel generator,where the ensuing proximity matrix represents the data-driven RF kernel. Kernel perspective on the RF has been used to develop a principled framework for theoretical investigation of its statistical properties. However, practical utility of the links between kernels and the RF has not been widely explored and systematically evaluated.Focus of our work is investigation of the interplay between kernel methods and the RF. We elucidate the performance and properties of the data driven RF kernels used by regularized linear models in a comprehensive simulation study comprising of continuous, binary and survival targets. We show that for continuous and survival targets, the RF kernels are competitive to RF in higher dimensional scenarios with larger number of noisy features. For the binary target, the RF kernel and RF exhibit comparable performance. As the RF kernel asymptotically converges to the Laplace kernel, we included it in our evaluation. For most simulation setups, the RF and RFkernel outperformed the Laplace kernel. Nevertheless, in some cases the Laplace kernel was competitive, showing its potential value for applications. We also provide the results from real life data sets for the regression, classification and survival to illustrate how these insights may be leveraged in practice.Finally, we discuss further extensions of the RF kernels in the context of interpretable prototype and landmarking classification, regression and survival. We outline future line of research for kernels furnished by Bayesian counterparts of the RF. △ Less

Submitted 31 August, 2020; originally announced September 2020.

arXiv:2003.02943 [pdf]

A deep learning-facilitated radiomics solution for the prediction of lung lesion shrinkage in non-small cell lung cancer trials

Authors: Antong Chen, Jennifer Saouaf, Bo Zhou, Randolph Crawford, Jianda Yuan, Junshui Ma, Richard Baumgartner, Shubing Wang, Gregory Goldmacher

Abstract: Herein we propose a deep learning-based approach for the prediction of lung lesion response based on radiomic features extracted from clinical CT scans of patients in non-small cell lung cancer trials. The approach starts with the classification of lung lesions from the set of primary and metastatic lesions at various anatomic locations. Focusing on the lung lesions, we perform automatic segmentat… ▽ More Herein we propose a deep learning-based approach for the prediction of lung lesion response based on radiomic features extracted from clinical CT scans of patients in non-small cell lung cancer trials. The approach starts with the classification of lung lesions from the set of primary and metastatic lesions at various anatomic locations. Focusing on the lung lesions, we perform automatic segmentation to extract their 3D volumes. Radiomic features are then extracted from the lesion on the pre-treatment scan and the first follow-up scan to predict which lesions will shrink at least 30% in diameter during treatment (either Pembrolizumab or combinations of chemotherapy and Pembrolizumab), which is defined as a partial response by the Response Evaluation Criteria In Solid Tumors (RECIST) guidelines. A 5-fold cross validation on the training set led to an AUC of 0.84 +/- 0.03, and the prediction on the testing dataset reached AUC of 0.73 +/- 0.02 for the outcome of 30% diameter shrinkage. △ Less

Submitted 5 March, 2020; originally announced March 2020.

Comments: Accepted by International Symposium on Biomedical Imaging (ISBI) 2020

arXiv:1901.03990 [pdf, other]

Formation of three-dimensional auditory space

Authors: Piotr Majdak, Robert Baumgartner, Claudia Jenny

Abstract: Human listeners need to permanently interact with their three-dimensional (3-D) environment. To this end, they require efficient perceptual mechanisms to form a sufficiently accurate 3-D auditory space. In this chapter, we discuss the formation of the 3-D auditory space from various perspectives. The aim is to show the link between cognition, acoustics, neurophysiology, and psychophysics, when it… ▽ More Human listeners need to permanently interact with their three-dimensional (3-D) environment. To this end, they require efficient perceptual mechanisms to form a sufficiently accurate 3-D auditory space. In this chapter, we discuss the formation of the 3-D auditory space from various perspectives. The aim is to show the link between cognition, acoustics, neurophysiology, and psychophysics, when it comes to spatial hearing. First, we present recent cognitive concepts for creating internal models of the complex auditory environment. Second, we describe the acoustic signals available at our ears and discuss the spatial information they convey. Third, we look into neurophysiology, seeking for the neural substrates of the 3-D auditory space. Finally, we elaborate on psychophysical spatial tasks and percepts that are possible just because of the formation of the auditory space. △ Less

Submitted 13 January, 2019; originally announced January 2019.

arXiv:1207.0246 [pdf, other]

doi 10.1016/j.knosys.2014.07.007

Web Data Extraction, Applications and Techniques: A Survey

Authors: Emilio Ferrara, Pasquale De Meo, Giacomo Fiumara, Robert Baumgartner

Abstract: Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey a… ▽ More Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains. △ Less

Submitted 9 June, 2014; v1 submitted 1 July, 2012; originally announced July 2012.

Comments: Knowledge-based Systems

Journal ref: Knowledge-Based Systems, 70, 301-323. 2014

arXiv:1106.3967 [pdf, ps, other]

doi 10.1007/978-3-642-23954-0_26

Intelligent Self-Repairable Web Wrappers

Authors: Emilio Ferrara, Robert Baumgartner

Abstract: The amount of information available on the Web grows at an incredible high rate. Systems and procedures devised to extract these data from Web sources already exist, and different approaches and techniques have been investigated during the last years. On the one hand, reliable solutions should provide robust algorithms of Web data mining which could automatically face possible malfunctioning or fa… ▽ More The amount of information available on the Web grows at an incredible high rate. Systems and procedures devised to extract these data from Web sources already exist, and different approaches and techniques have been investigated during the last years. On the one hand, reliable solutions should provide robust algorithms of Web data mining which could automatically face possible malfunctioning or failures. On the other, in literature there is a lack of solutions about the maintenance of these systems. Procedures that extract Web data may be strictly interconnected with the structure of the data source itself; thus, malfunctioning or acquisition of corrupted data could be caused, for example, by structural modifications of data sources brought by their owners. Nowadays, verification of data integrity and maintenance are mostly manually managed, in order to ensure that these systems work correctly and reliably. In this paper we propose a novel approach to create procedures able to extract data from Web sources -- the so called Web wrappers -- which can face possible malfunctioning caused by modifications of the structure of the data source, and can automatically repair themselves. △ Less

Submitted 20 June, 2011; originally announced June 2011.

Comments: 12 pages, 4 figures; Proceedings of the 12th International Conference of the Italian Association for Artificial Intelligence, 2011

Journal ref: Lecture Notes in Computer Science, 6934:274-285, 2011

arXiv:1103.1254 [pdf, other]

Design of Automatically Adaptable Web Wrappers

Authors: Emilio Ferrara, Robert Baumgartner

Abstract: Nowadays, the huge amount of information distributed through the Web motivates studying techniques to be adopted in order to extract relevant data in an efficient and reliable way. Both academia and enterprises developed several approaches of Web data extraction, for example using techniques of artificial intelligence or machine learning. Some commonly adopted procedures, namely wrappers, ensure a… ▽ More Nowadays, the huge amount of information distributed through the Web motivates studying techniques to be adopted in order to extract relevant data in an efficient and reliable way. Both academia and enterprises developed several approaches of Web data extraction, for example using techniques of artificial intelligence or machine learning. Some commonly adopted procedures, namely wrappers, ensure a high degree of precision of information extracted from Web pages, and, at the same time, have to prove robustness in order not to compromise quality and reliability of data themselves. In this paper we focus on some experimental aspects related to the robustness of the data extraction process and the possibility of automatically adapting wrappers. We discuss the implementation of algorithms for finding similarities between two different version of a Web page, in order to handle modifications, avoiding the failure of data extraction tasks and ensuring reliability of information extracted. Our purpose is to evaluate performances, advantages and draw-backs of our novel system of automatic wrapper adaptation. △ Less

Submitted 7 March, 2011; originally announced March 2011.

Comments: 7 pages, 2 figures, In Proceedings of the 3rd International Conference on Agents and Artificial Intelligence (ICAART 2011)

Journal ref: Proceedings of the 3rd International Conference on Agents and Artificial Intelligence, pp 211-216, 2011

arXiv:1103.1252 [pdf, other]

doi 10.1007/978-3-642-19618-8_3

Automatic Wrapper Adaptation by Tree Edit Distance Matching

Authors: Emilio Ferrara, Robert Baumgartner

Abstract: Information distributed through the Web keeps growing faster day by day, and for this reason, several techniques for extracting Web data have been suggested during last years. Often, extraction tasks are performed through so called wrappers, procedures extracting information from Web pages, e.g. implementing logic-based techniques. Many fields of application today require a strong degree of robust… ▽ More Information distributed through the Web keeps growing faster day by day, and for this reason, several techniques for extracting Web data have been suggested during last years. Often, extraction tasks are performed through so called wrappers, procedures extracting information from Web pages, e.g. implementing logic-based techniques. Many fields of application today require a strong degree of robustness of wrappers, in order not to compromise assets of information or reliability of data extracted. Unfortunately, wrappers may fail in the task of extracting data from a Web page, if its structure changes, sometimes even slightly, thus requiring the exploiting of new techniques to be automatically held so as to adapt the wrapper to the new structure of the page, in case of failure. In this work we present a novel approach of automatic wrapper adaptation based on the measurement of similarity of trees through improved tree edit distance matching techniques. △ Less

Submitted 7 March, 2011; originally announced March 2011.

Comments: 7 pages, 3 figures, In Proceedings of the 2nd International Workshop on Combinations of Intelligent Methods and Applications (CIMA 2010)

Journal ref: Combinations of Intelligent Methods and Applications Smart Innovation, Systems and Technologies Volume 8, 2011, pp 41-54

arXiv:0903.1880 [pdf, other]

doi 10.1109/TMI.2010.2044512

SMART: A statistical framework for optimal design matrix generation with application to fMRI

Authors: Gautam Pendse, Adam Schwarz, Richard Baumgartner, Alexandre Coimbra, David Borsook, Lino Becerra

Abstract: The general linear model (GLM) is a well established tool for analyzing functional magnetic resonance imaging (fMRI) data. Most fMRI analyses via GLM proceed in a massively univariate fashion where the same design matrix is used for analyzing data from each voxel. A major limitation of this approach is the locally varying nature of signals of interest as well as associated confounds. This local… ▽ More The general linear model (GLM) is a well established tool for analyzing functional magnetic resonance imaging (fMRI) data. Most fMRI analyses via GLM proceed in a massively univariate fashion where the same design matrix is used for analyzing data from each voxel. A major limitation of this approach is the locally varying nature of signals of interest as well as associated confounds. This local variability results in a potentially large bias and uncontrolled increase in variance for the contrast of interest. The main contributions of this paper are two fold (1) We develop a statistical framework called SMART that enables estimation of an optimal design matrix while explicitly controlling the bias variance decomposition over a set of potential design matrices and (2) We develop and validate a numerical algorithm for computing optimal design matrices for general fMRI data sets. The implications of this framework include the ability to match optimally the magnitude of underlying signals to their true magnitudes while also matching the "null" signals to zero size thereby optimizing both the sensitivity and specificity of signal detection. By enabling the capture of multiple profiles of interest using a single contrast (as opposed to an F-test) in a way that optimizes for both bias and variance enables the passing of first level parameter estimates and their variances to the higher level for group analysis which is not possible using F-tests. We demonstrate the application of this approach to in vivo pharmacological fMRI data capturing the acute response to a drug infusion, to task-evoked, block design fMRI and to the estimation of a haemodynamic response function (HRF) response in event-related fMRI. Our framework is quite general and has potentially wide applicability to a variety of disciplines. △ Less

Submitted 11 March, 2009; originally announced March 2009.

Comments: 68 pages, 34 figures

Showing 1–20 of 20 results for author: Baumgartner, R