Search | arXiv e-print repository

ClusterRadar: an Interactive Web-Tool for the Multi-Method Exploration of Spatial Clusters Over Time

Authors: Lee Mason, Blánaid Hicks, Jonas S. Almeida

Abstract: Spatial cluster analysis, the detection of localized patterns of similarity in geospatial data, has a wide-range of applications for scientific discovery and practical decision making. One way to detect spatial clusters is by using local indicators of spatial association, such as Local Moran's I or Getis-Ord Gi*. However, different indicators tend to produce substantially different results due to… ▽ More Spatial cluster analysis, the detection of localized patterns of similarity in geospatial data, has a wide-range of applications for scientific discovery and practical decision making. One way to detect spatial clusters is by using local indicators of spatial association, such as Local Moran's I or Getis-Ord Gi*. However, different indicators tend to produce substantially different results due to their distinct operational characteristics. Choosing a suitable method or comparing results from multiple methods is a complex task. Furthermore, spatial clusters are dynamic and it is often useful to track their evolution over time, which adds an additional layer of complexity. ClusterRadar is a web-tool designed to address these analytical challenges. The tool allows users to easily perform spatial clustering and analyze the results in an interactive environment, uniquely prioritizing temporal analysis and the comparison of multiple methods. The tool's interactive dashboard presents several visualizations, each offering a distinct perspective of the temporal and methodological aspects of the spatial clustering results. ClusterRadar has several features designed to maximize its utility to a broad user-base, including support for various geospatial formats, and a fully in-browser execution environment to preserve the privacy of sensitive data. Feedback from a varied set of researchers suggests ClusterRadar's potential for enhancing the temporal analysis of spatial clusters. △ Less

Submitted 8 April, 2024; originally announced April 2024.

Comments: Submitted to IEEE Vis 2024

arXiv:2310.01413 [pdf]

A multi-institutional pediatric dataset of clinical radiology MRIs by the Children's Brain Tumor Network

Authors: Ariana M. Familiar, Anahita Fathi Kazerooni, Hannah Anderson, Aliaksandr Lubneuski, Karthik Viswanathan, Rocky Breslow, Nastaran Khalili, Sina Bagheri, Debanjan Haldar, Meen Chul Kim, Sherjeel Arif, Rachel Madhogarhia, Thinh Q. Nguyen, Elizabeth A. Frenkel, Zeinab Helili, Jessica Harrison, Keyvan Farahani, Marius George Linguraru, Ulas Bagci, Yury Velichko, Jeffrey Stevens, Sarah Leary, Robert M. Lober, Stephani Campion, Amy A. Smith , et al. (15 additional authors not shown)

Abstract: Pediatric brain and spinal cancers remain the leading cause of cancer-related death in children. Advancements in clinical decision-support in pediatric neuro-oncology utilizing the wealth of radiology imaging data collected through standard care, however, has significantly lagged other domains. Such data is ripe for use with predictive analytics such as artificial intelligence (AI) methods, which… ▽ More Pediatric brain and spinal cancers remain the leading cause of cancer-related death in children. Advancements in clinical decision-support in pediatric neuro-oncology utilizing the wealth of radiology imaging data collected through standard care, however, has significantly lagged other domains. Such data is ripe for use with predictive analytics such as artificial intelligence (AI) methods, which require large datasets. To address this unmet need, we provide a multi-institutional, large-scale pediatric dataset of 23,101 multi-parametric MRI exams acquired through routine care for 1,526 brain tumor patients, as part of the Children's Brain Tumor Network. This includes longitudinal MRIs across various cancer diagnoses, with associated patient-level clinical information, digital pathology slides, as well as tissue genotype and omics data. To facilitate downstream analysis, treatment-naïve images for 370 subjects were processed and released through the NCI Childhood Cancer Data Initiative via the Cancer Data Service. Through ongoing efforts to continuously build these imaging repositories, our aim is to accelerate discovery and translational AI models with real-world data, to ultimately empower precision medicine for children. △ Less

Submitted 2 October, 2023; originally announced October 2023.

arXiv:2302.14624 [pdf, other]

The 2022 NIST Language Recognition Evaluation

Authors: Yooyoung Lee, Craig Greenberg, Eliot Godard, Asad A. Butt, Elliot Singer, Trang Nguyen, Lisa Mason, Douglas Reynolds

Abstract: In 2022, the U.S. National Institute of Standards and Technology (NIST) conducted the latest Language Recognition Evaluation (LRE) in an ongoing series administered by NIST since 1996 to foster research in language recognition and to measure state-of-the-art technology. Similar to previous LREs, LRE22 focused on conversational telephone speech (CTS) and broadcast narrowband speech (BNBS) data. LRE… ▽ More In 2022, the U.S. National Institute of Standards and Technology (NIST) conducted the latest Language Recognition Evaluation (LRE) in an ongoing series administered by NIST since 1996 to foster research in language recognition and to measure state-of-the-art technology. Similar to previous LREs, LRE22 focused on conversational telephone speech (CTS) and broadcast narrowband speech (BNBS) data. LRE22 also introduced new evaluation features, such as an emphasis on African languages, including low resource languages, and a test set consisting of segments containing between 3s and 35s of speech randomly sampled and extracted from longer recordings. A total of 21 research organizations, forming 16 teams, participated in this 3-month long evaluation and made a total of 65 valid system submissions to be evaluated. This paper presents an overview of LRE22 and an analysis of system performance over different evaluation conditions. The evaluation results suggest that Oromo and Tigrinya are easier to detect while Xhosa and Zulu are more challenging. A greater confusability is seen for some language pairs. When speech duration increased, system performance significantly increased up to a certain duration, and then a diminishing return on system performance is observed afterward. △ Less

Submitted 28 February, 2023; originally announced February 2023.

Comments: 5 pages, 10 figures

arXiv:2301.03469 [pdf, other]

KIDS: kinematics-based (in)activity detection and segmentation in a sleep case study

Authors: Omar Elnaggar, Roselina Arelhi, Frans Coenen, Andrew Hopkinson, Lyndon Mason, Paolo Paoletti

Abstract: Sleep behaviour and in-bed movements contain rich information on the neurophysiological health of people, and have a direct link to the general well-being and quality of life. Standard clinical practices rely on polysomnography for sleep assessment; however, it is intrusive, performed in unfamiliar environments and requires trained personnel. Progress has been made on less invasive sensor technolo… ▽ More Sleep behaviour and in-bed movements contain rich information on the neurophysiological health of people, and have a direct link to the general well-being and quality of life. Standard clinical practices rely on polysomnography for sleep assessment; however, it is intrusive, performed in unfamiliar environments and requires trained personnel. Progress has been made on less invasive sensor technologies, such as actigraphy, but clinical validation raises concerns over their reliability and precision. Additionally, the field lacks a widely acceptable algorithm, with proposed approaches ranging from raw signal or feature thresholding to data-hungry classification models, many of which are unfamiliar to medical staff. This paper proposes an online Bayesian probabilistic framework for objective (in)activity detection and segmentation based on clinically meaningful joint kinematics, measured by a custom-made wearable sensor. Intuitive three-dimensional visualisations of kinematic timeseries were accomplished through dimension reduction based preprocessing, offering out-of-the-box framework explainability potentially useful for clinical monitoring and diagnosis. The proposed framework attained up to 99.2\% $F_1$-score and 0.96 Pearson's correlation coefficient in, respectively, the posture change detection and inactivity segmentation tasks. The work paves the way for a reliable home-based analysis of movements during sleep which would serve patient-centred longitudinal care plans. △ Less

Submitted 4 January, 2023; originally announced January 2023.

Comments: 18 pages, 8 figures, supplementary info included

arXiv:2208.00003 [pdf, other]

RangL: A Reinforcement Learning Competition Platform

Authors: Viktor Zobernig, Richard A. Saldanha, **ke He, Erica van der Sar, Jasper van Doorn, Jia-Chen Hua, Lachlan R. Mason, Aleksander Czechowski, Drago Indjic, Tomasz Kosmala, Alessandro Zocca, Sandjai Bhulai, Jorge Montalvo Arvizu, Claude Klöckl, John Moriarty

Abstract: The RangL project hosted by The Alan Turing Institute aims to encourage the wider uptake of reinforcement learning by supporting competitions relating to real-world dynamic decision problems. This article describes the reusable code repository developed by the RangL team and deployed for the 2022 Pathways to Net Zero Challenge, supported by the UK Net Zero Technology Centre. The winning solutions… ▽ More The RangL project hosted by The Alan Turing Institute aims to encourage the wider uptake of reinforcement learning by supporting competitions relating to real-world dynamic decision problems. This article describes the reusable code repository developed by the RangL team and deployed for the 2022 Pathways to Net Zero Challenge, supported by the UK Net Zero Technology Centre. The winning solutions to this particular Challenge seek to optimize the UK's energy transition policy to net zero carbon emissions by 2050. The RangL repository includes an OpenAI Gym reinforcement learning environment and code that supports both submission to, and evaluation in, a remote instance of the open source EvalAI platform as well as all winning learning agent strategies. The repository is an illustrative example of RangL's capability to provide a reusable structure for future challenges. △ Less

Submitted 28 July, 2022; originally announced August 2022.

Comments: Documents in general and premierly the RangL competition plattform and in particular its 2022's competition "Pathways to Netzero" 10 pages, 2 figures, 1 table, Comments welcome!

arXiv:2205.10778 [pdf, other]

Sleep Posture One-Shot Learning Framework Using Kinematic Data Augmentation: In-Silico and In-Vivo Case Studies

Authors: Omar Elnaggar, Frans Coenen, Andrew Hopkinson, Lyndon Mason, Paolo Paoletti

Abstract: Sleep posture is linked to several health conditions such as nocturnal cramps and more serious musculoskeletal issues. However, in-clinic sleep assessments are often limited to vital signs (e.g. brain waves). Wearable sensors with embedded inertial measurement units have been used for sleep posture classification; nonetheless, previous works consider only few (commonly four) postures, which are in… ▽ More Sleep posture is linked to several health conditions such as nocturnal cramps and more serious musculoskeletal issues. However, in-clinic sleep assessments are often limited to vital signs (e.g. brain waves). Wearable sensors with embedded inertial measurement units have been used for sleep posture classification; nonetheless, previous works consider only few (commonly four) postures, which are inadequate for advanced clinical assessments. Moreover, posture learning algorithms typically require longitudinal data collection to function reliably, and often operate on raw inertial sensor readings unfamiliar to clinicians. This paper proposes a new framework for sleep posture classification based on a minimal set of joint angle measurements. The proposed framework is validated on a rich set of twelve postures in two experimental pipelines: computer animation to obtain synthetic postural data, and human participant pilot study using custom-made miniature wearable sensors. Through fusing raw geo-inertial sensor measurements to compute a filtered estimate of relative segment orientations across the wrist and ankle joints, the body posture can be characterised in a way comprehensible to medical experts. The proposed sleep posture learning framework offers plug-and-play posture classification by capitalising on a novel kinematic data augmentation method that requires only one training example per posture. Additionally, a new metric together with data visualisations are employed to extract meaningful insights from the postures dataset, demonstrate the added value of the data augmentation method, and explain the classification performance. The proposed framework attained promising overall accuracy as high as 100% on synthetic data and 92.7% on real data, on par with state of the art data-hungry algorithms available in the literature. △ Less

Submitted 22 May, 2022; originally announced May 2022.

Comments: 27 pages, 15 figures

arXiv:2204.10242 [pdf, other]

The 2021 NIST Speaker Recognition Evaluation

Authors: Seyed Omid Sadjadi, Craig Greenberg, Elliot Singer, Lisa Mason, Douglas Reynolds

Abstract: The 2021 Speaker Recognition Evaluation (SRE21) was the latest cycle of the ongoing evaluation series conducted by the U.S. National Institute of Standards and Technology (NIST) since 1996. It was the second large-scale multimodal speaker/person recognition evaluation organized by NIST (the first one being SRE19). Similar to SRE19, it featured two core evaluation tracks, namely audio and audio-vis… ▽ More The 2021 Speaker Recognition Evaluation (SRE21) was the latest cycle of the ongoing evaluation series conducted by the U.S. National Institute of Standards and Technology (NIST) since 1996. It was the second large-scale multimodal speaker/person recognition evaluation organized by NIST (the first one being SRE19). Similar to SRE19, it featured two core evaluation tracks, namely audio and audio-visual, as well as an optional visual track. In addition to offering fixed and open training conditions, it also introduced new challenges for the community, thanks to a new multimodal (i.e., audio, video, and selfie images) and multilingual (i.e., with multilingual speakers) corpus, termed WeCanTalk, collected outside North America by the Linguistic Data Consortium (LDC). These challenges included: 1) trials (target and non-target) with enrollment and test segments originating from different domains (i.e., telephony versus video), and 2) trials (target and non-target) with enrollment and test segments spoken in different languages (i.e., cross-lingual trials). This paper presents an overview of SRE21 including the tasks, performance metric, data, evaluation protocol, results and system performance analyses. A total of 23 organizations (forming 15 teams) from academia and industry participated in SRE21 and submitted 158 valid system outputs. Evaluation results indicate: audio-visual fusion produce substantial gains in performance over audio-only or visual-only systems; top performing speaker and face recognition systems exhibited comparable performance under the matched domain conditions present in this evaluation; and, the use of complex neural network architectures (e.g., ResNet) along with angular losses with margin, data augmentation, as well as long duration fine-tuning contributed to notable performance improvements for the audio-only speaker recognition task. △ Less

Submitted 21 April, 2022; originally announced April 2022.

arXiv:2204.10228 [pdf, other]

The NIST CTS Speaker Recognition Challenge

Authors: Seyed Omid Sadjadi, Craig Greenberg, Elliot Singer, Lisa Mason, Douglas Reynolds

Abstract: The US National Institute of Standards and Technology (NIST) has been conducting a second iteration of the CTS challenge since August 2020. The current iteration of the CTS Challenge is a leaderboard-style speaker recognition evaluation using telephony data extracted from the unexposed portions of the Call My Net 2 (CMN2) and Multi-Language Speech (MLS) corpora collected by the LDC. The CTS Challe… ▽ More The US National Institute of Standards and Technology (NIST) has been conducting a second iteration of the CTS challenge since August 2020. The current iteration of the CTS Challenge is a leaderboard-style speaker recognition evaluation using telephony data extracted from the unexposed portions of the Call My Net 2 (CMN2) and Multi-Language Speech (MLS) corpora collected by the LDC. The CTS Challenge is currently organized in a similar manner to the SRE19 CTS Challenge, offering only an open training condition using two evaluation subsets, namely Progress and Test. Unlike in the SRE19 Challenge, no training or development set was initially released, and NIST has publicly released the leaderboards on both subsets for the CTS Challenge. Which subset (i.e., Progress or Test) a trial belongs to is unknown to challenge participants, and each system submission needs to contain outputs for all of the trials. The CTS Challenge has also served, and will continue to do so, as a prerequisite for entrance to the regular SREs (such as SRE21). Since August 2020, a total of 53 organizations (forming 33 teams) from academia and industry have participated in the CTS Challenge and submitted more than 4400 valid system outputs. This paper presents an overview of the evaluation and several analyses of system performance for some primary conditions in the CTS Challenge. The CTS Challenge results thus far indicate remarkable improvements in performance due to 1) speaker embeddings extracted using large-scale and complex neural network architectures such as ResNets along with angular margin losses for speaker embedding extraction, 2) extensive data augmentation, 3) the use of large amounts of in-house proprietary data from a large number of labeled speakers, 4) long-duration fine-tuning. △ Less

Submitted 21 April, 2022; originally announced April 2022.

arXiv:2202.13778 [pdf, other]

Rule-based Evolutionary Bayesian Learning

Authors: Themistoklis Botsas, Lachlan R. Mason, Omar K. Matar, Indranil Pan

Abstract: In our previous work, we introduced the rule-based Bayesian Regression, a methodology that leverages two concepts: (i) Bayesian inference, for the general framework and uncertainty quantification and (ii) rule-based systems for the incorporation of expert knowledge and intuition. The resulting method creates a penalty equivalent to a common Bayesian prior, but it also includes information that typ… ▽ More In our previous work, we introduced the rule-based Bayesian Regression, a methodology that leverages two concepts: (i) Bayesian inference, for the general framework and uncertainty quantification and (ii) rule-based systems for the incorporation of expert knowledge and intuition. The resulting method creates a penalty equivalent to a common Bayesian prior, but it also includes information that typically would not be available within a standard Bayesian context. In this work, we extend the aforementioned methodology with grammatical evolution, a symbolic genetic programming technique that we utilise for automating the rules' derivation. Our motivation is that grammatical evolution can potentially detect patterns from the data with valuable information, equivalent to that of expert knowledge. We illustrate the use of the rule-based Evolutionary Bayesian learning technique by applying it to synthetic as well as real data, and examine the results in terms of point predictions and associated uncertainty. △ Less

Submitted 28 February, 2022; originally announced February 2022.

Comments: 16 pages, 22 figures

arXiv:2111.06223 [pdf, other]

Data-Centric Engineering: integrating simulation, machine learning and statistics. Challenges and Opportunities

Authors: Indranil Pan, Lachlan Mason, Omar Matar

Abstract: Recent advances in machine learning, coupled with low-cost computation, availability of cheap streaming sensors, data storage and cloud technologies, has led to widespread multi-disciplinary research activity with significant interest and investment from commercial stakeholders. Mechanistic models, based on physical equations, and purely data-driven statistical approaches represent two ends of the… ▽ More Recent advances in machine learning, coupled with low-cost computation, availability of cheap streaming sensors, data storage and cloud technologies, has led to widespread multi-disciplinary research activity with significant interest and investment from commercial stakeholders. Mechanistic models, based on physical equations, and purely data-driven statistical approaches represent two ends of the modelling spectrum. New hybrid, data-centric engineering approaches, leveraging the best of both worlds and integrating both simulations and data, are emerging as a powerful tool with a transformative impact on the physical disciplines. We review the key research trends and application scenarios in the emerging field of integrating simulations, machine learning, and statistics. We highlight the opportunities that such an integrated vision can unlock and outline the key challenges holding back its realisation. We also discuss the bottlenecks in the translational aspects of the field and the long-term upskilling requirements of the existing workforce and future university graduates. △ Less

Submitted 22 November, 2021; v1 submitted 7 November, 2021; originally announced November 2021.

Comments: 20 pages, 2 figures

arXiv:2103.11919 [pdf]

doi 10.1029/2021MS002550

Machine Learning Emulation of 3D Cloud Radiative Effects

Authors: David Meyer, Robin J. Hogan, Peter D. Dueben, Shannon L. Mason

Abstract: The treatment of cloud structure in numerical weather and climate models is often greatly simplified to make them computationally affordable. Here we propose to correct the European Centre for Medium-Range Weather Forecasts 1D radiation scheme ecRad for 3D cloud effects using computationally cheap neural networks. 3D cloud effects are learned as the difference between ecRad's fast 1D Tripleclouds… ▽ More The treatment of cloud structure in numerical weather and climate models is often greatly simplified to make them computationally affordable. Here we propose to correct the European Centre for Medium-Range Weather Forecasts 1D radiation scheme ecRad for 3D cloud effects using computationally cheap neural networks. 3D cloud effects are learned as the difference between ecRad's fast 1D Tripleclouds solver that neglects them and its 3D SPARTACUS (SPeedy Algorithm for Radiative TrAnsfer through CloUd Sides) solver that includes them but is about five times more computationally expensive. With typical errors between 20 % and 30 % of the 3D signal, neural networks improve Tripleclouds' accuracy for about 1 % increase in runtime. Thus, rather than emulating the whole of SPARTACUS, we keep Tripleclouds unchanged for cloud-free parts of the atmosphere and 3D-correct it elsewhere. The focus on the comparably small 3D correction instead of the entire signal allows us to improve predictions significantly if we assume a similar signal-to-noise ratio for both. △ Less

Submitted 15 March, 2022; v1 submitted 22 March, 2021; originally announced March 2021.

Comments: Published version

Journal ref: Meyer, D., Hogan, R. J., Dueben, P. D., & Mason, S. L. (2022). Machine Learning Emulation of 3D Cloud Radiative Effects. Journal of Advances in Modeling Earth Systems, 14(3)

arXiv:2008.00422 [pdf, other]

Rule-based Bayesian regression

Authors: Themistoklis Botsas, Lachlan R. Mason, Indranil Pan

Abstract: We introduce a novel rule-based approach for handling regression problems. The new methodology carries elements from two frameworks: (i) it provides information about the uncertainty of the parameters of interest using Bayesian inference, and (ii) it allows the incorporation of expert knowledge through rule-based systems. The blending of those two different frameworks can be particularly beneficia… ▽ More We introduce a novel rule-based approach for handling regression problems. The new methodology carries elements from two frameworks: (i) it provides information about the uncertainty of the parameters of interest using Bayesian inference, and (ii) it allows the incorporation of expert knowledge through rule-based systems. The blending of those two different frameworks can be particularly beneficial for various domains (e.g. engineering), where, even though the significance of uncertainty quantification motivates a Bayesian approach, there is no simple way to incorporate researcher intuition into the model. We validate our models by applying them to synthetic applications: a simple linear regression problem and two more complex structures based on partial differential equations. Finally, we review the advantages of our methodology, which include the simplicity of the implementation, the uncertainty reduction due to the added information and, in some occasions, the derivation of better point predictions, and we address limitations, mainly from the computational complexity perspective, such as the difficulty in choosing an appropriate algorithm and the added computational burden. △ Less

Submitted 8 October, 2021; v1 submitted 2 August, 2020; originally announced August 2020.

arXiv:2007.12167 [pdf, other]

doi 10.1016/j.physd.2020.132797

Latent-space time evolution of non-intrusive reduced-order models using Gaussian process emulation

Authors: Romit Maulik, Themistoklis Botsas, Nesar Ramachandra, Lachlan Robert Mason, Indranil Pan

Abstract: Non-intrusive reduced-order models (ROMs) have recently generated considerable interest for constructing computationally efficient counterparts of nonlinear dynamical systems emerging from various domain sciences. They provide a low-dimensional emulation framework for systems that may be intrinsically high-dimensional. This is accomplished by utilizing a construction algorithm that is purely data-… ▽ More Non-intrusive reduced-order models (ROMs) have recently generated considerable interest for constructing computationally efficient counterparts of nonlinear dynamical systems emerging from various domain sciences. They provide a low-dimensional emulation framework for systems that may be intrinsically high-dimensional. This is accomplished by utilizing a construction algorithm that is purely data-driven. It is no surprise, therefore, that the algorithmic advances of machine learning have led to non-intrusive ROMs with greater accuracy and computational gains. However, in bypassing the utilization of an equation-based evolution, it is often seen that the interpretability of the ROM framework suffers. This becomes more problematic when black-box deep learning methods are used which are notorious for lacking robustness outside the physical regime of the observed data. In this article, we propose the use of a novel latent-space interpolation algorithm based on Gaussian process regression. Notably, this reduced-order evolution of the system is parameterized by control parameters to allow for interpolation in space. The use of this procedure also allows for a continuous interpretation of time which allows for temporal interpolation. The latter aspect provides information, with quantified uncertainty, about full-state evolution at a finer resolution than that utilized for training the ROMs. We assess the viability of this algorithm for an advection-dominated system given by the inviscid shallow water equations. △ Less

Submitted 15 October, 2020; v1 submitted 23 July, 2020; originally announced July 2020.

arXiv:2007.07276 [pdf, other]

Numerical simulation, clustering and prediction of multi-component polymer precipitation

Authors: Pavan Inguva, Lachlan Mason, Indranil Pan, Miselle Hengardi, Omar K. Matar

Abstract: Multi-component polymer systems are of interest in organic photovoltaic and drug delivery applications, among others where diverse morphologies influence performance. An improved understanding of morphology classification, driven by composition-informed prediction tools, will aid polymer engineering practice. We use a modified Cahn-Hilliard model to simulate polymer precipitation. Such physics-bas… ▽ More Multi-component polymer systems are of interest in organic photovoltaic and drug delivery applications, among others where diverse morphologies influence performance. An improved understanding of morphology classification, driven by composition-informed prediction tools, will aid polymer engineering practice. We use a modified Cahn-Hilliard model to simulate polymer precipitation. Such physics-based models require high-performance computations that prevent rapid prototy** and iteration in engineering settings. To reduce the required computational costs, we apply machine learning techniques for clustering and consequent prediction of the simulated polymer blend images in conjunction with simulations. Integrating ML and simulations in such a manner reduces the number of simulations needed to map out the morphology of polymer blends as a function of input parameters and also generates a data set which can be used by others to this end. We explore dimensionality reduction, via principal component analysis and autoencoder techniques, and analyse the resulting morphology clusters. Supervised machine learning using Gaussian process classification was subsequently used to predict morphology clusters according to species molar fraction and interaction parameter inputs. Manual pattern clustering yielded the best results, but machine learning techniques were able to predict the morphology of polymer blends with $\geq$ 90 $\%$ accuracy. △ Less

Submitted 26 August, 2020; v1 submitted 10 July, 2020; originally announced July 2020.

Comments: 18 pages, 10 figures, supporting info in anc, fixed typos and references

ACM Class: I.6.5; I.6.3; I.5.4

arXiv:2003.07701 [pdf]

doi 10.1017/dce.2020.8

Data-driven surrogate modelling and benchmarking for process equipment

Authors: Gabriel F. N. Gonçalves, Assen Batchvarov, Yuyi Liu, Yuxin Liu, Lachlan Mason, Indranil Pan, Omar K. Matar

Abstract: In chemical process engineering, surrogate models of complex systems are often necessary for tasks of domain exploration, sensitivity analysis of the design parameters, and optimization. A suite of computational fluid dynamics (CFD) simulations geared toward chemical process equipment modeling has been developed and validated with experimental results from the literature. Various regression-based… ▽ More In chemical process engineering, surrogate models of complex systems are often necessary for tasks of domain exploration, sensitivity analysis of the design parameters, and optimization. A suite of computational fluid dynamics (CFD) simulations geared toward chemical process equipment modeling has been developed and validated with experimental results from the literature. Various regression-based active learning strategies are explored with these CFD simulators in-the-loop under the constraints of a limited function evaluation budget. Specifically, five different sampling strategies and five regression techniques are compared, considering a set of four test cases of industrial significance and varying complexity. Gaussian process regression was observed to have a consistently good performance for these applications. The present quantitative study outlines the pros and cons of the different available techniques and highlights the best practices for their adoption. The test cases and tools are available with an open-source license to ensure reproducibility and engage the wider research community in contributing to both the CFD models and develo** and benchmarking new improved algorithms tailored to this field. △ Less

Submitted 8 September, 2020; v1 submitted 13 March, 2020; originally announced March 2020.

Journal ref: Data-Centric Engineering (2020), 1, E7

arXiv:cs/0007026 [pdf, ps]

Integrating E-Commerce and Data Mining: Architecture and Challenges

Authors: Suhail Ansari, Ron Kohavi, Llew Mason, Zijian Zheng

Abstract: We show that the e-commerce domain can provide all the right ingredients for successful data mining and claim that it is a killer domain for data mining. We describe an integrated architecture, based on our expe-rience at Blue Martini Software, for supporting this integration. The architecture can dramatically reduce the pre-processing, cleaning, and data understanding effort often documented to… ▽ More We show that the e-commerce domain can provide all the right ingredients for successful data mining and claim that it is a killer domain for data mining. We describe an integrated architecture, based on our expe-rience at Blue Martini Software, for supporting this integration. The architecture can dramatically reduce the pre-processing, cleaning, and data understanding effort often documented to take 80% of the time in knowledge discovery projects. We emphasize the need for data collection at the application server layer (not the web server) in order to support logging of data and metadata that is essential to the discovery process. We describe the data transformation bridges required from the transaction processing systems and customer event streams (e.g., clickstreams) to the data warehouse. We detail the mining workbench, which needs to provide multiple views of the data through reporting, data mining algorithms, visualization, and OLAP. We con-clude with a set of challenges. △ Less

Submitted 13 July, 2000; originally announced July 2000.

Comments: KDD workshop: WebKDD 2000

ACM Class: I.2.6; H.2.8

Journal ref: WEBKDD'2000 workshop: Web Mining for E-Commerce -- Challenges and Opportunities

Showing 1–16 of 16 results for author: Mason, L