Search | arXiv e-print repository

Using iterated local alignment to aggregate GPS trajectories into a traffic flow map

Abstract: Desire line maps are widely deployed for traffic flow analysis by virtue of their ease of interpretation and computation. They can be considered to be simplified traffic flow maps, whereas the computational challenges in aggregating small scale traffic flows prevent the wider dissemination of high resolution flow maps. GPS trajectories are a promising data source to solve this challenging problem.… ▽ More Desire line maps are widely deployed for traffic flow analysis by virtue of their ease of interpretation and computation. They can be considered to be simplified traffic flow maps, whereas the computational challenges in aggregating small scale traffic flows prevent the wider dissemination of high resolution flow maps. GPS trajectories are a promising data source to solve this challenging problem. The solution begins with the alignment (or map matching) of the GPS trajectories to the road network. However even the state-of-the-art map matching APIs produce sub-optimal results with small misalignments. While these misalignments are negligible for large scale flow aggregation in desire line maps, they pose substantial obstacles for small scale flow aggregation in high resolution maps. To remove these remaining misalignments, we introduce innovative local alignment algorithms, where we infer road segments to serve as local reference segments, and proceed to align nearby road segments to them. With each local alignment iteration, the misalignments of the GPS trajectories with each other and with the road network are reduced, and so converge closer to a minimal flow map. By analysing a set of empirical GPS trajectories collected in Hannover, Germany, we confirm that our minimal flow map has high levels of spatial resolution, accuracy and coverage. △ Less

Submitted 25 June, 2024; originally announced June 2024.

MSC Class: 62P30

arXiv:2203.01686 [pdf, other]

Statistical visualisation for tidy and geospatial data in R via kernel smoothing methods in the eks package

Authors: Tarn Duong

Abstract: Kernel smoothers are essential tools for data analysis due to their ability to convey complex statistical information with concise graphical visualisations. Their inclusion in the base distribution and in the many user-contributed add-on packages of the R statistical analysis environment caters well to many practitioners. Though there remain some important gaps for specialised data types, most not… ▽ More Kernel smoothers are essential tools for data analysis due to their ability to convey complex statistical information with concise graphical visualisations. Their inclusion in the base distribution and in the many user-contributed add-on packages of the R statistical analysis environment caters well to many practitioners. Though there remain some important gaps for specialised data types, most notably for tibbles (tidy data) within the tidyverse, and for simple features (geospatial data) within geospatial analysis. The proposed eks package fills in these gaps. In addition to kernel density estimation, this package also caters for more complex data analysis situations, such as density derivative estimation, density-based classification (supervised learning) and mean shift clustering (unsupervised learning). We illustrate with experimental data how to obtain and to interpret the statistical visualisations for these kernel smoothing methods. △ Less

Submitted 24 March, 2023; v1 submitted 3 March, 2022; originally announced March 2022.

Comments: 19 pages, 10 figures

MSC Class: 62G07; 62G10; 62H12

arXiv:2202.13001 [pdf, other]

Non-stationary Bandits and Meta-Learning with a Small Set of Optimal Arms

Authors: MohammadJavad Azizi, Thang Duong, Yasin Abbasi-Yadkori, András György, Claire Vernade, Mohammad Ghavamzadeh

Abstract: We study a sequential decision problem where the learner faces a sequence of $K$-armed bandit tasks. The task boundaries might be known (the bandit meta-learning setting), or unknown (the non-stationary bandit setting). For a given integer $M\le K$, the learner aims to compete with the best subset of arms of size $M$. We design an algorithm based on a reduction to bandit submodular maximizati… ▽ More We study a sequential decision problem where the learner faces a sequence of $K$-armed bandit tasks. The task boundaries might be known (the bandit meta-learning setting), or unknown (the non-stationary bandit setting). For a given integer $M\le K$, the learner aims to compete with the best subset of arms of size $M$. We design an algorithm based on a reduction to bandit submodular maximization, and show that, for $T$ rounds comprised of $N$ tasks, in the regime of large number of tasks and small number of optimal arms $M$, its regret in both settings is smaller than the simple baseline of $\tilde{O}(\sqrt{KNT})$ that can be obtained by using standard algorithms designed for non-stationary bandit problems. For the bandit meta-learning problem with fixed task length $τ$, we show that the regret of the algorithm is bounded as $\tilde{O}(NM\sqrt{M τ}+N^{2/3}Mτ)$. Under additional assumptions on the identifiability of the optimal arms in each task, we show a bandit meta-learning algorithm with an improved $\tilde{O}(N\sqrt{M τ}+N^{1/2}\sqrt{M K τ})$ regret. △ Less

Submitted 18 October, 2022; v1 submitted 25 February, 2022; originally announced February 2022.

arXiv:2105.12898 [pdf, other]

Stochastic Intervention for Causal Effect Estimation

Authors: Tri Dung Duong, Qian Li, Guandong Xu

Abstract: Causal inference methods are widely applied in various decision-making domains such as precision medicine, optimal policy and economics. Central to these applications is the treatment effect estimation of intervention strategies. Current estimation methods are mostly restricted to the deterministic treatment, which however, is unable to address the stochastic space treatment policies. Moreover, pr… ▽ More Causal inference methods are widely applied in various decision-making domains such as precision medicine, optimal policy and economics. Central to these applications is the treatment effect estimation of intervention strategies. Current estimation methods are mostly restricted to the deterministic treatment, which however, is unable to address the stochastic space treatment policies. Moreover, previous methods can only make binary yes-or-no decisions based on the treatment effect, lacking the capability of providing fine-grained effect estimation degree to explain the process of decision making. In our study, we therefore advance the causal inference research to estimate stochastic intervention effect by devising a new stochastic propensity score and stochastic intervention effect estimator (SIE). Meanwhile, we design a customized genetic algorithm specific to stochastic intervention effect (Ge-SIO) with the aim of providing causal evidence for decision making. We provide the theoretical analysis and conduct an empirical study to justify that our proposed measures and algorithms can achieve a significant performance lift in comparison with state-of-the-art baselines. △ Less

Submitted 26 May, 2021; originally announced May 2021.

Comments: Accepted in IJCNN 21

arXiv:2102.06381 [pdf, other]

doi 10.1016/j.treng.2021.100061

Relaxing door-to-door matching reduces passenger waiting times: a workflow for the analysis of driver GPS traces in a stochastic carpooling service

Authors: Panayotis Papoutsis, Safa Fennia, Constant Bridon, Tarn Duong

Abstract: Carpooling has the potential to transform itself into a mass transportation mode by abandoning its adherence to deterministic passenger-driver matching for door-to-door journeys, and by adopting instead stochastic matching on a network of fixed meeting points. Stochastic matching is where a passenger sends out a carpooling request at a meeting point, and then waits for the arrival of a self-select… ▽ More Carpooling has the potential to transform itself into a mass transportation mode by abandoning its adherence to deterministic passenger-driver matching for door-to-door journeys, and by adopting instead stochastic matching on a network of fixed meeting points. Stochastic matching is where a passenger sends out a carpooling request at a meeting point, and then waits for the arrival of a self-selected driver who is already travelling to the requested meeting point. Crucially there is no centrally dispatched driver. Moreover, the carpooling is assured only between the meeting points, so the onus is on the passengers to travel to/from them by their own means. Thus the success of a stochastic carpooling service relies on the convergence, with minimal perturbation to their existing travel patterns, to the meeting points which are highly frequented by both passengers and drivers. Due to the innovative nature of stochastic carpooling, existing off-the-shelf workflows are largely insufficient for this purpose. To fill the gap in the market, we introduce a novel workflow, comprising of a combination of data science and GIS (Geographic Information Systems), to analyse driver GPS traces. We implement it for an operational stochastic carpooling service in south-eastern France, and we demonstrate that relaxing door-to-door matching reduces passenger waiting times. Our workflow provides additional key operational indicators, namely the driver flow maps, the driver flow temporal profiles and the driver participation rates. △ Less

Submitted 12 February, 2021; originally announced February 2021.

arXiv:2007.08962 [pdf, other]

doi 10.1080/02664763.2022.2026896

Bayesian hierarchical models for the prediction of the driver flow and passenger waiting times in a stochastic carpooling service

Authors: Panayotis Papoutsis, Bertrand Michel, Anne Philippe, Tarn Duong

Abstract: Carpooling is an integral component in smart carbon-neutral cities, in particular to facilitate homework commuting. We study an innovative carpooling service developed by the start-up Ecov which specialises in homework commutes in peri-urban and rural regions. When a passenger makes a carpooling request, a designated driver is not assigned as in a traditional carpooling service; rather the passeng… ▽ More Carpooling is an integral component in smart carbon-neutral cities, in particular to facilitate homework commuting. We study an innovative carpooling service developed by the start-up Ecov which specialises in homework commutes in peri-urban and rural regions. When a passenger makes a carpooling request, a designated driver is not assigned as in a traditional carpooling service; rather the passenger waits for the first driver, from a population of non-professional drivers who are already en route, to arrive. We propose a two-stage Bayesian hierarchical model to overcome the considerable difficulties, due to the sparsely observed driver and passenger data from an embryonic stochastic carpooling service, to deliver high-quality predictions of driver flow and passenger waiting times. The first stage focuses on the driver flow, whose predictions are aggregated at the daily level to compensate the data sparsity. The second stage processes this single daily driver flow into sub-daily (e.g. hourly) predictions of the passenger waiting times. We demonstrate that our model mostly outperforms frequentist and non-hierarchical Bayesian methods for observed data from operational carpooling service in Lyon, France and we also validated our model on simulated data. △ Less

Submitted 17 July, 2020; originally announced July 2020.

arXiv:2006.16789 [pdf, other]

Causality Learning: A New Perspective for Interpretable Machine Learning

Authors: Guandong Xu, Tri Dung Duong, Qian Li, Shaowu Liu, Xianzhi Wang

Abstract: Recent years have witnessed the rapid growth of machine learning in a wide range of fields such as image recognition, text classification, credit scoring prediction, recommendation system, etc. In spite of their great performance in different sectors, researchers still concern about the mechanism under any machine learning (ML) techniques that are inherently black-box and becoming more complex to… ▽ More Recent years have witnessed the rapid growth of machine learning in a wide range of fields such as image recognition, text classification, credit scoring prediction, recommendation system, etc. In spite of their great performance in different sectors, researchers still concern about the mechanism under any machine learning (ML) techniques that are inherently black-box and becoming more complex to achieve higher accuracy. Therefore, interpreting machine learning model is currently a mainstream topic in the research community. However, the traditional interpretable machine learning focuses on the association instead of the causality. This paper provides an overview of causal analysis with the fundamental background and key concepts, and then summarizes most recent causal approaches for interpretable machine learning. The evaluation techniques for assessing method quality, and open problems in causal interpretability are also discussed in this paper. △ Less

Submitted 17 September, 2021; v1 submitted 27 June, 2020; originally announced June 2020.

Comments: 8 Pages

arXiv:2005.11856 [pdf, other]

Predicting COVID-19 Pneumonia Severity on Chest X-ray with Deep Learning

Authors: Joseph Paul Cohen, Lan Dao, Paul Morrison, Karsten Roth, Yoshua Bengio, Beiyi Shen, Almas Abbasi, Mahsa Hoshmand-Kochi, Marzyeh Ghassemi, Haifang Li, Tim Q Duong

Abstract: Purpose: The need to streamline patient management for COVID-19 has become more pressing than ever. Chest X-rays provide a non-invasive (potentially bedside) tool to monitor the progression of the disease. In this study, we present a severity score prediction model for COVID-19 pneumonia for frontal chest X-ray images. Such a tool can gauge severity of COVID-19 lung infections (and pneumonia in ge… ▽ More Purpose: The need to streamline patient management for COVID-19 has become more pressing than ever. Chest X-rays provide a non-invasive (potentially bedside) tool to monitor the progression of the disease. In this study, we present a severity score prediction model for COVID-19 pneumonia for frontal chest X-ray images. Such a tool can gauge severity of COVID-19 lung infections (and pneumonia in general) that can be used for escalation or de-escalation of care as well as monitoring treatment efficacy, especially in the ICU. Methods: Images from a public COVID-19 database were scored retrospectively by three blinded experts in terms of the extent of lung involvement as well as the degree of opacity. A neural network model that was pre-trained on large (non-COVID-19) chest X-ray datasets is used to construct features for COVID-19 images which are predictive for our task. Results: This study finds that training a regression model on a subset of the outputs from an this pre-trained chest X-ray model predicts our geographic extent score (range 0-8) with 1.14 mean absolute error (MAE) and our lung opacity score (range 0-6) with 0.78 MAE. Conclusions: These results indicate that our model's ability to gauge severity of COVID-19 lung infections could be used for escalation or de-escalation of care as well as monitoring treatment efficacy, especially in the intensive care unit (ICU). A proper clinical trial is needed to evaluate efficacy. To enable this we make our code, labels, and data available online at https://github.com/mlmed/torchxrayvision/tree/master/scripts/covid-severity and https://github.com/ieee8023/covid-chestxray-dataset △ Less

Submitted 30 June, 2020; v1 submitted 24 May, 2020; originally announced May 2020.

arXiv:1911.08795 [pdf]

On Node Features for Graph Neural Networks

Authors: Chi Thang Duong, Thanh Dat Hoang, Ha The Hien Dang, Quoc Viet Hung Nguyen, Karl Aberer

Abstract: Graph neural network (GNN) is a deep model for graph representation learning. One advantage of graph neural network is its ability to incorporate node features into the learning process. However, this prevents graph neural network from being applied into featureless graphs. In this paper, we first analyze the effects of node features on the performance of graph neural network. We show that GNNs wo… ▽ More Graph neural network (GNN) is a deep model for graph representation learning. One advantage of graph neural network is its ability to incorporate node features into the learning process. However, this prevents graph neural network from being applied into featureless graphs. In this paper, we first analyze the effects of node features on the performance of graph neural network. We show that GNNs work well if there is a strong correlation between node features and node labels. Based on these results, we propose new feature initialization methods that allows to apply graph neural network to non-attributed graphs. Our experimental results show that the artificial features are highly competitive with real features. △ Less

Submitted 20 November, 2019; originally announced November 2019.

arXiv:1909.02977 [pdf, other]

Parallel Computation of Graph Embeddings

Authors: Chi Thang Duong, Hongzhi Yin, Thanh Dat Hoang, Truong Giang Le Ba, Matthias Weidlich, Quoc Viet Hung Nguyen, Karl Aberer

Abstract: Graph embedding aims at learning a vector-based representation of vertices that incorporates the structure of the graph. This representation then enables inference of graph properties. Existing graph embedding techniques, however, do not scale well to large graphs. We therefore propose a framework for parallel computation of a graph embedding using a cluster of compute nodes with resource constrai… ▽ More Graph embedding aims at learning a vector-based representation of vertices that incorporates the structure of the graph. This representation then enables inference of graph properties. Existing graph embedding techniques, however, do not scale well to large graphs. We therefore propose a framework for parallel computation of a graph embedding using a cluster of compute nodes with resource constraints. We show how to distribute any existing embedding technique by first splitting a graph for any given set of constrained compute nodes and then reconciling the embedding spaces derived for these subgraphs. We also propose a new way to evaluate the quality of graph embeddings that is independent of a specific inference task. Based thereon, we give a formal bound on the difference between the embeddings derived by centralised and parallel computation. Experimental results illustrate that our approach for parallel computation scales well, while largely maintaining the embedding quality. △ Less

Submitted 6 September, 2019; originally announced September 2019.

arXiv:1902.04181 [pdf, other]

doi 10.1007/978-3-030-86383-8_8

Nearest Neighbor Median Shift Clustering for Binary Data

Authors: Gaël Beck, Tarn Duong, Mustapha Lebbah, Hanane Azzag

Abstract: We describe in this paper the theory and practice behind a new modal clustering method for binary data. Our approach (BinNNMS) is based on the nearest neighbor median shift. The median shift is an extension of the well-known mean shift, which was designed for continuous data, to handle binary data. We demonstrate that BinNNMS can discover accurately the location of clusters in binary data with the… ▽ More We describe in this paper the theory and practice behind a new modal clustering method for binary data. Our approach (BinNNMS) is based on the nearest neighbor median shift. The median shift is an extension of the well-known mean shift, which was designed for continuous data, to handle binary data. We demonstrate that BinNNMS can discover accurately the location of clusters in binary data with theoretical and experimental analyses. △ Less

Submitted 11 February, 2019; originally announced February 2019.

Comments: Algorithms are available at https://github.com/Clustering4Ever/Clustering4Ever

arXiv:1902.03833 [pdf, other]

doi 10.1016/j.jpdc.2019.07.015

A Distributed and Approximated Nearest Neighbors Algorithm for an Efficient Large Scale Mean Shift Clustering

Authors: Gaël Beck, Tarn Duong, Mustapha Lebbah, Hanane Azzag, Christophe Cérin

Abstract: In this paper we target the class of modal clustering methods where clusters are defined in terms of the local modes of the probability density function which generates the data. The most well-known modal clustering method is the k-means clustering. Mean Shift clustering is a generalization of the k-means clustering which computes arbitrarily shaped clusters as defined as the basins of attraction… ▽ More In this paper we target the class of modal clustering methods where clusters are defined in terms of the local modes of the probability density function which generates the data. The most well-known modal clustering method is the k-means clustering. Mean Shift clustering is a generalization of the k-means clustering which computes arbitrarily shaped clusters as defined as the basins of attraction to the local modes created by the density gradient ascent paths. Despite its potential, the Mean Shift approach is a computationally expensive method for unsupervised learning. Thus, we introduce two contributions aiming to provide clustering algorithms with a linear time complexity, as opposed to the quadratic time complexity for the exact Mean Shift clustering. Firstly we propose a scalable procedure to approximate the density gradient ascent. Second, our proposed scalable cluster labeling technique is presented. Both propositions are based on Locality Sensitive Hashing (LSH) to approximate nearest neighbors. These two techniques may be used for moderate sized datasets. Furthermore, we show that using our proposed approximations of the density gradient ascent as a pre-processing step in other clustering methods can also improve dedicated classification metrics. For the latter, a distributed implementation, written for the Spark/Scala ecosystem is proposed. For all these considered clustering methods, we present experimental results illustrating their labeling accuracy and their potential to solve concrete problems. △ Less

Submitted 11 February, 2019; originally announced February 2019.

Comments: Algorithms are available at https://github.com/Clustering4Ever/Clustering4Ever

arXiv:1806.05769 [pdf, other]

Bayesian Uncertainty Quantification and Information Fusion in CALPHAD-based Thermodynamic Modeling

Authors: Pejman Honarmandi, Thien Chi Duong, Seyede Fatemeh Ghoreishi, Douglas Allaire, Raymundo Arroyave

Abstract: Calculation of phase diagrams is one of the fundamental tools in alloy design---more specifically under the framework of Integrated Computational Materials Engineering. Uncertainty quantification of phase diagrams is the first step required to provide confidence for decision making in property- or performance-based design. As a manner of illustration, a thorough probabilistic assessment of the CAL… ▽ More Calculation of phase diagrams is one of the fundamental tools in alloy design---more specifically under the framework of Integrated Computational Materials Engineering. Uncertainty quantification of phase diagrams is the first step required to provide confidence for decision making in property- or performance-based design. As a manner of illustration, a thorough probabilistic assessment of the CALPHAD model parameters is performed against the available data for a Hf-Si binary case study using a Markov Chain Monte Carlo sampling approach. The plausible optimum values and uncertainties of the parameters are thus obtained, which can be propagated to the resulting phase diagram. Using the parameter values obtained from deterministic optimization in a computational thermodynamic assessment tool (in this case Thermo-Calc) as the prior information for the parameter values and ranges in the sampling process is often necessary to achieve a reasonable cost for uncertainty quantification. This brings up the problem of finding an appropriate CALPHAD model with high-level of confidence which is a very hard and costly task that requires considerable expert skill. A Bayesian hypothesis testing based on Bayes' factors is proposed to fulfill the need of model selection in this case, which is applied to compare four recommended models for the Hf-Si system. However, it is demonstrated that information fusion approaches, i.e., Bayesian model averaging and an error correlation-based model fusion, can be used to combine the useful information existing in all the given models rather than just using the best selected model, which may lack some information about the system being modelled. △ Less

Submitted 18 July, 2018; v1 submitted 12 June, 2018; originally announced June 2018.

Comments: 22 pages, 8 Figures

arXiv:1602.08807 [pdf, other]

doi 10.1080/10485252.2018.1537442

Exploratory data analysis for moderate extreme values using non-parametric kernel methods

Authors: Boris Beranger, Tarn Duong, Sarah E. Perkins-Kirkpatrick, Scott A. Sisson

Abstract: In many settings it is critical to accurately model the extreme tail behaviour of a random process. Non-parametric density estimation methods are commonly implemented as exploratory data analysis techniques for this purpose as they possess excellent visualisation properties, and can naturally avoid the model specification biases implied by using parametric estimators. In particular, kernel-based e… ▽ More In many settings it is critical to accurately model the extreme tail behaviour of a random process. Non-parametric density estimation methods are commonly implemented as exploratory data analysis techniques for this purpose as they possess excellent visualisation properties, and can naturally avoid the model specification biases implied by using parametric estimators. In particular, kernel-based estimators place minimal assumptions on the data, and provide improved visualisation over scatterplots and histograms. However kernel density estimators are known to perform poorly when estimating extreme tail behaviour, which is important when interest is in process behaviour above some large threshold, and they can over-emphasise bumps in the density for heavy tailed data. In this article we develop a transformation kernel density estimator, and demonstrate that its mean integrated squared error (MISE) efficiency is equivalent to that of standard, non-tail focused kernel density estimators. Estimator performance is illustrated in numerical studies, and in an expanded analysis of the ability of well known global climate models to reproduce observed temperature extremes in Sydney, Australia. △ Less

Submitted 6 December, 2017; v1 submitted 28 February, 2016; originally announced February 2016.

arXiv:1310.2559 [pdf, other]

doi 10.1007/s11222-014-9465-1

Efficient recursive algorithms for functionals based on higher order derivatives of the multivariate Gaussian density

Authors: José E. Chacón, Tarn Duong

Abstract: Many developments in Mathematics involve the computation of higher order derivatives of Gaussian density functions. The analysis of univariate Gaussian random variables is a well-established field whereas the analysis of their multivariate counterparts consists of a body of results which are more dispersed. These latter results generally fall into two main categories: theoretical expressions which… ▽ More Many developments in Mathematics involve the computation of higher order derivatives of Gaussian density functions. The analysis of univariate Gaussian random variables is a well-established field whereas the analysis of their multivariate counterparts consists of a body of results which are more dispersed. These latter results generally fall into two main categories: theoretical expressions which reveal the deep structure of the problem, or computational algorithms which can mask the connections with closely related problems. In this paper, we unify existing results and develop new results in a framework which is both conceptually cogent and computationally efficient. We focus on the underlying connections between higher order derivatives of Gaussian density functions, the expected value of products of quadratic forms in Gaussian random variables, and V-statistics of degree two based on Gaussian density functions. These three sets of results are combined into an analysis of non-parametric data smoothers. △ Less

Submitted 23 March, 2014; v1 submitted 9 October, 2013; originally announced October 2013.

Comments: 30 pages, 1 figure

MSC Class: 15A24; 65F30; 62E10; 62G05; 62H05

arXiv:1305.7344 [pdf, other]

doi 10.1371/journal.pone.0100334

Joint Modeling and Registration of Cell Populations in Cohorts of High-Dimensional Flow Cytometric Data

Authors: Saumyadipta Pyne, Kui Wang, Jonathan Irish, Pablo Tamayo, Marc-Danie Nazaire, Tarn Duong, Sharon Lee, Shu-Kay Ng, David Hafler, Ronald Levy, Garry Nolan, Jill Mesirov, Geoffrey J. McLachlan

Abstract: In systems biomedicine, an experimenter encounters different potential sources of variation in data such as individual samples, multiple experimental conditions, and multi-variable network-level responses. In multiparametric cytometry, which is often used for analyzing patient samples, such issues are critical. While computational methods can identify cell populations in individual samples, withou… ▽ More In systems biomedicine, an experimenter encounters different potential sources of variation in data such as individual samples, multiple experimental conditions, and multi-variable network-level responses. In multiparametric cytometry, which is often used for analyzing patient samples, such issues are critical. While computational methods can identify cell populations in individual samples, without the ability to automatically match them across samples, it is difficult to compare and characterize the populations in typical experiments, such as those responding to various stimulations or distinctive of particular patients or time-points, especially when there are many samples. Joint Clustering and Matching (JCM) is a multi-level framework for simultaneous modeling and registration of populations across a cohort. JCM models every population with a robust multivariate probability distribution. Simultaneously, JCM fits a random-effects model to construct an overall batch template -- used for registering populations across samples, and classifying new samples. By tackling systems-level variation, JCM supports practical biomedical applications involving large cohorts. △ Less

Submitted 31 May, 2013; originally announced May 2013.

arXiv:1204.6160 [pdf, other]

doi 10.1214/13-EJS781

Data-driven density derivative estimation, with applications to nonparametric clustering and bump hunting

Authors: José E. Chacón, Tarn Duong

Abstract: Important information concerning a multivariate data set, such as clusters and modal regions, is contained in the derivatives of the probability density function. Despite this importance, nonparametric estimation of higher order derivatives of the density functions have received only relatively scant attention. Kernel estimators of density functions are widely used as they exhibit excellent theore… ▽ More Important information concerning a multivariate data set, such as clusters and modal regions, is contained in the derivatives of the probability density function. Despite this importance, nonparametric estimation of higher order derivatives of the density functions have received only relatively scant attention. Kernel estimators of density functions are widely used as they exhibit excellent theoretical and practical properties, though their generalization to density derivatives has progressed more slowly due to the mathematical intractabilities encountered in the crucial problem of bandwidth (or smoothing parameter) selection. This paper presents the first fully automatic, data-based bandwidth selectors for multivariate kernel density derivative estimators. This is achieved by synthesizing recent advances in matrix analytic theory which allow mathematically and computationally tractable representations of higher order derivatives of multivariate vector valued functions. The theoretical asymptotic properties as well as the finite sample behaviour of the proposed selectors are studied. {In addition, we explore in detail the applications of the new data-driven methods for two other statistical problems: clustering and bump hunting. The introduced techniques are combined with the mean shift algorithm to develop novel automatic, nonparametric clustering procedures which are shown to outperform mixture-model cluster analysis and other recent nonparametric approaches in practice. Furthermore, the advantage of the use of smoothing parameters designed for density derivative estimation for feature significance analysis for bump hunting is illustrated with a real data example. △ Less

Submitted 19 February, 2013; v1 submitted 27 April, 2012; originally announced April 2012.

Comments: 36 pages, 5 figures

MSC Class: 62G05; 62H30

Showing 1–17 of 17 results for author: Duong, T