Search | arXiv e-print repository

Two-sample KS test with approxQuantile in Apache Spark

Authors: Bradley Eck, Duygu Kabakci-Zorlu, Amadou Ba

Abstract: The classical two-sample test of Kolmogorov-Smirnov (KS) is widely used to test whether empirical samples come from the same distribution. Even though most statistical packages provide an implementation, carrying out the test in big data settings can be challenging because it requires a full sort of the data. The popular Apache Spark system for big data processing provides a 1-sample KS test, but… ▽ More The classical two-sample test of Kolmogorov-Smirnov (KS) is widely used to test whether empirical samples come from the same distribution. Even though most statistical packages provide an implementation, carrying out the test in big data settings can be challenging because it requires a full sort of the data. The popular Apache Spark system for big data processing provides a 1-sample KS test, but not the 2-sample version. Moreover, recent Spark versions provide the approxQuantile method for querying $ε$-approximate quantiles. We build on approxQuantile to propose a variation of the classical Kolmogorov-Smirnov two-sample test that constructs approximate cumulative distribution functions (CDF) from $ε$-approximate quantiles. We derive error bounds of the approximate CDF and show how to use this information to carry out KS tests. Psuedocode for the approach requires 15 executable lines. A Python implementation appears in the appendix. △ Less

Submitted 14 December, 2023; originally announced December 2023.

Comments: 6 pages, 3 figures, python code in appendix. To appear at IEEE Big Data 2023

arXiv:2304.08352 [pdf, other]

What Makes a Good Dataset for Symbol Description Reading?

Authors: Karol Lynch, Joern Ploennigs, Bradley Eck

Abstract: The usage of mathematical formulas as concise representations of a document's key ideas is common practice. Correctly interpreting these formulas, by identifying mathematical symbols and extracting their descriptions, is an important task in document understanding. This paper makes the following contributions to the mathematical identifier description reading (MIDR) task: (i) introduces the Math… ▽ More The usage of mathematical formulas as concise representations of a document's key ideas is common practice. Correctly interpreting these formulas, by identifying mathematical symbols and extracting their descriptions, is an important task in document understanding. This paper makes the following contributions to the mathematical identifier description reading (MIDR) task: (i) introduces the Math Formula Question Answering Dataset (MFQuAD) with $7508$ annotated identifier occurrences; (ii) describes novel variations of the noun phrase ranking approach for the MIDR task; (iii) reports experimental results for the SOTA noun phrase ranking approach and our novel variations of the approach, providing problem insights and a performance baseline; (iv) provides a position on the features that make an effective dataset for the MIDR task. △ Less

Submitted 17 April, 2023; originally announced April 2023.

arXiv:2211.06239 [pdf, other]

A monitoring framework for deployed machine learning models with supply chain examples

Authors: Bradley Eck, Duygu Kabakci-Zorlu, Yan Chen, France Savard, Xiaowei Bao

Abstract: Actively monitoring machine learning models during production operations helps ensure prediction quality and detection and remediation of unexpected or undesired conditions. Monitoring models already deployed in big data environments brings the additional challenges of adding monitoring in parallel to the existing modelling workflow and controlling resource requirements. In this paper, we describe… ▽ More Actively monitoring machine learning models during production operations helps ensure prediction quality and detection and remediation of unexpected or undesired conditions. Monitoring models already deployed in big data environments brings the additional challenges of adding monitoring in parallel to the existing modelling workflow and controlling resource requirements. In this paper, we describe (1) a framework for monitoring machine learning models; and, (2) its implementation for a big data supply chain application. We use our implementation to study drift in model features, predictions, and performance on three real data sets. We compare hypothesis test and information theoretic approaches to drift detection in features and predictions using the Kolmogorov-Smirnov distance and Bhattacharyya coefficient. Results showed that model performance was stable over the evaluation period. Features and predictions showed statistically significant drifts; however, these drifts were not linked to changes in model performance during the time of our study. △ Less

Submitted 11 November, 2022; originally announced November 2022.

Comments: 8 pages, 9 figures, IEEE Big Data 2022

arXiv:2103.07248 [pdf, other]

Knowledge- and Data-driven Services for Energy Systems using Graph Neural Networks

Authors: Francesco Fusco, Bradley Eck, Robert Gormally, Mark Purcell, Seshu Tirupathi

Abstract: The transition away from carbon-based energy sources poses several challenges for the operation of electricity distribution systems. Increasing shares of distributed energy resources (e.g. renewable energy generators, electric vehicles) and internet-connected sensing and control devices (e.g. smart heating and cooling) require new tools to support accurate, datadriven decision making. Modelling th… ▽ More The transition away from carbon-based energy sources poses several challenges for the operation of electricity distribution systems. Increasing shares of distributed energy resources (e.g. renewable energy generators, electric vehicles) and internet-connected sensing and control devices (e.g. smart heating and cooling) require new tools to support accurate, datadriven decision making. Modelling the effect of such growing complexity in the electrical grid is possible in principle using state-of-the-art power-power flow models. In practice, the detailed information needed for these physical simulations may be unknown or prohibitively expensive to obtain. Hence, datadriven approaches to power systems modelling, including feedforward neural networks and auto-encoders, have been studied to leverage the increasing availability of sensor data, but have seen limited practical adoption due to lack of transparency and inefficiencies on large-scale problems. Our work addresses this gap by proposing a data- and knowledge-driven probabilistic graphical model for energy systems based on the framework of graph neural networks (GNNs). The model can explicitly factor in domain knowledge, in the form of grid topology or physics constraints, thus resulting in sparser architectures and much smaller parameters dimensionality when compared with traditional machine-learning models with similar accuracy. Results obtained from a real-world smart-grid demonstration project show how the GNN was used to inform grid congestion predictions and market bidding services for a distribution system operator participating in an energy flexibility market. △ Less

Submitted 12 March, 2021; originally announced March 2021.

Comments: Accepted for publication in proceedings of IEEE Conference of Big Data 2020

arXiv:2003.12141 [pdf, other]

Scalable Deployment of AI Time-series Models for IoT

Authors: Bradley Eck, Francesco Fusco, Robert Gormally, Mark Purcell, Seshu Tirupathi

Abstract: IBM Research Castor, a cloud-native system for managing and deploying large numbers of AI time-series models in IoT applications, is described. Modelling code templates, in Python and R, following a typical machine-learning workflow are supported. A knowledge-based approach to managing model and time-series data allows the use of general semantic concepts for expressing feature engineering tasks.… ▽ More IBM Research Castor, a cloud-native system for managing and deploying large numbers of AI time-series models in IoT applications, is described. Modelling code templates, in Python and R, following a typical machine-learning workflow are supported. A knowledge-based approach to managing model and time-series data allows the use of general semantic concepts for expressing feature engineering tasks. Model templates can be programmatically deployed against specific instances of semantic concepts, thus supporting model reuse and automated replication as the IoT application grows. Deployed models are automatically executed in parallel leveraging a serverless cloud computing framework. The complete history of trained model versions and rolling-horizon predictions is persisted, thus enabling full model lineage and traceability. Results from deployments in real-world smart-grid live forecasting applications are reported. Scalability of executing up to tens of thousands of AI modelling tasks is also evaluated. △ Less

Submitted 24 March, 2020; originally announced March 2020.

Journal ref: Workshop AI for Internet of Things, IJCAI 2019

arXiv:1909.10870 [pdf, other]

doi 10.1145/3307772.3330158

AI Modelling and Time-series Forecasting Systems for Trading Energy Flexibility in Distribution Grids

Authors: Bradley Eck, Francesco Fusco, Robert Gormally, Mark Purcell, Seshu Tirupathi

Abstract: We demonstrate progress on the deployment of two sets of technologies to support distribution grid operators integrating high shares of renewable energy sources, based on a market for trading local energy flexibilities. An artificial-intelligence (AI) grid modelling tool, based on probabilistic graphs, predicts congestions and estimates the amount and location of energy flexibility required to avo… ▽ More We demonstrate progress on the deployment of two sets of technologies to support distribution grid operators integrating high shares of renewable energy sources, based on a market for trading local energy flexibilities. An artificial-intelligence (AI) grid modelling tool, based on probabilistic graphs, predicts congestions and estimates the amount and location of energy flexibility required to avoid such events. A scalable time-series forecasting system delivers large numbers of short-term predictions of distributed energy demand and generation. We discuss the deployment of the technologies at three trial demonstration sites across Europe, in the context of a research project carried out in a consortium with energy utilities, technology providers and research institutions. △ Less

Submitted 18 September, 2019; originally announced September 2019.

arXiv:1811.08566 [pdf, other]

Castor: Contextual IoT Time Series Data and Model Management at Scale

Authors: Bei Chen, Bradley Eck, Francesco Fusco, Robert Gormally, Mark Purcell, Mathieu Sinn, Seshu Tirupathi

Abstract: We demonstrate Castor, a cloud-based system for contextual IoT time series data and model management at scale. Castor is designed to assist Data Scientists in (a) exploring and retrieving all relevant time series and contextual information that is required for their predictive modelling tasks; (b) seamlessly storing and deploying their predictive models in a cloud production environment; (c) monit… ▽ More We demonstrate Castor, a cloud-based system for contextual IoT time series data and model management at scale. Castor is designed to assist Data Scientists in (a) exploring and retrieving all relevant time series and contextual information that is required for their predictive modelling tasks; (b) seamlessly storing and deploying their predictive models in a cloud production environment; (c) monitoring the performance of all predictive models in production and (semi-)automatically retraining them in case of performance deterioration. The main features of Castor are: (1) an efficient pipeline for ingesting IoT time series data in real time; (2) a scalable, hybrid data management service for both time series and contextual data; (3) a versatile semantic model for contextual information which can be easily adopted to different application domains; (4) an abstract framework for develo** and storing predictive models in R or Python; (5) deployment services which automatically train and/or score predictive models upon user-defined conditions. We demonstrate Castor for a real-world Smart Grid use case and discuss how it can be adopted to other application domains such as Smart Buildings, Telecommunication, Retail or Manufacturing. △ Less

Submitted 8 February, 2019; v1 submitted 20 November, 2018; originally announced November 2018.

Comments: 6 pages, 6 figures, ICDM 2018

arXiv:1810.09354 [pdf, other]

Generation of Virtual Dual Energy Images from Standard Single-Shot Radiographs using Multi-scale and Conditional Adversarial Network

Authors: Bo Zhou, Xunyu Lin, Brendan Eck, Jun Hou, David L. Wilson

Abstract: Dual-energy (DE) chest radiographs provide greater diagnostic information than standard radiographs by separating the image into bone and soft tissue, revealing suspicious lesions which may otherwise be obstructed from view. However, acquisition of DE images requires two physical scans, necessitating specialized hardware and processing, and images are prone to motion artifact. Generation of virtua… ▽ More Dual-energy (DE) chest radiographs provide greater diagnostic information than standard radiographs by separating the image into bone and soft tissue, revealing suspicious lesions which may otherwise be obstructed from view. However, acquisition of DE images requires two physical scans, necessitating specialized hardware and processing, and images are prone to motion artifact. Generation of virtual DE images from standard, single-shot chest radiographs would expand the diagnostic value of standard radiographs without changing the acquisition procedure. We present a Multi-scale Conditional Adversarial Network (MCA-Net) which produces high-resolution virtual DE bone images from standard, single-shot chest radiographs. Our proposed MCA-Net is trained using the adversarial network so that it learns sharp details for the production of high-quality bone images. Then, the virtual DE soft tissue image is generated by processing the standard radiograph with the virtual bone image using a cross projection transformation. Experimental results from 210 patient DE chest radiographs demonstrated that the algorithm can produce high-quality virtual DE chest radiographs. Important structures were preserved, such as coronary calcium in bone images and lung lesions in soft tissue images. The average structure similarity index and the peak signal to noise ratio of the produced bone images in testing data were 96.4 and 41.5, which are significantly better than results from previous methods. Furthermore, our clinical evaluation results performed on the publicly available dataset indicates the clinical values of our algorithms. Thus, our algorithm can produce high-quality DE images that are potentially useful for radiologists, computer-aided diagnostics, and other diagnostic tasks. △ Less

Submitted 14 April, 2021; v1 submitted 22 October, 2018; originally announced October 2018.

Comments: 16 pages, 7 figures, accepted by Asian Conference on Computer Vision (2018 ACCV), code available at https://github.com/bbbbbbzhou/Virtual-Dual-Energy

Showing 1–8 of 8 results for author: Eck, B