-
Efficient Model Selection for Predictive Pattern Mining Model by Safe Pattern Pruning
Authors:
Takumi Yoshida,
Hiroyuki Hanada,
Kazuya Nakagawa,
Kouichi Taji,
Koji Tsuda,
Ichiro Takeuchi
Abstract:
Predictive pattern mining is an approach used to construct prediction models when the input is represented by structured data, such as sets, graphs, and sequences. The main idea behind predictive pattern mining is to build a prediction model by considering substructures, such as subsets, subgraphs, and subsequences (referred to as patterns), present in the structured data as features of the model.…
▽ More
Predictive pattern mining is an approach used to construct prediction models when the input is represented by structured data, such as sets, graphs, and sequences. The main idea behind predictive pattern mining is to build a prediction model by considering substructures, such as subsets, subgraphs, and subsequences (referred to as patterns), present in the structured data as features of the model. The primary challenge in predictive pattern mining lies in the exponential growth of the number of patterns with the complexity of the structured data. In this study, we propose the Safe Pattern Pruning (SPP) method to address the explosion of pattern numbers in predictive pattern mining. We also discuss how it can be effectively employed throughout the entire model building process in practical data analysis. To demonstrate the effectiveness of the proposed method, we conduct numerical experiments on regression and classification problems involving sets, graphs, and sequences.
△ Less
Submitted 23 June, 2023;
originally announced June 2023.
-
A linearization for stable and fast geographically weighted Poisson regression
Authors:
Daisuke Murakami,
Narumasa Tsutsumida,
Takahiro Yoshida,
Tomoki Nakaya,
Binbin Lu,
Paul Harris
Abstract:
Although geographically weighted Poisson regression (GWPR) is a popular regression for spatially indexed count data, its development is relatively limited compared to that found for linear geographically weighted regression (GWR), where many extensions (e.g., multiscale GWR, scalable GWR) have been proposed. The weak development of GWPR can be attributed to the computational cost and identificatio…
▽ More
Although geographically weighted Poisson regression (GWPR) is a popular regression for spatially indexed count data, its development is relatively limited compared to that found for linear geographically weighted regression (GWR), where many extensions (e.g., multiscale GWR, scalable GWR) have been proposed. The weak development of GWPR can be attributed to the computational cost and identification problem in the underpinning Poisson regression model. This study proposes linearized GWPR (L-GWPR) by introducing a log-linear approximation into the GWPR model to overcome these bottlenecks. Because the L-GWPR model is identical to the Gaussian GWR model, it is free from the identification problem, easily implemented, computationally efficient, and offers similar potential for extension. Specifically, L-GWPR does not require a double-loop algorithm, which makes GWPR slow for large samples. Furthermore, we extended L-GWPR by introducing ridge regularization to enhance its stability (regularized L-GWPR). The results of the Monte Carlo experiments confirmed that regularized L-GWPR estimates local coefficients accurately and computationally efficiently. Finally, we compared GWPR and regularized L-GWPR through a crime analysis in Tokyo.
△ Less
Submitted 15 May, 2023;
originally announced May 2023.
-
Unit-level mixed effects models for conditional extremes
Authors:
Koki Momoki,
Takuma Yoshida
Abstract:
Extreme value theory (EVT) provides an elegant mathematical tool for statistically analyzing rare events. When data are collected from multiple population subgroups, the scientific interest of researchers would generally be to improve the estimates obtained directly from each subgroup because some subgroups may have less data available for extreme value analysis. To achieve this, we incorporate th…
▽ More
Extreme value theory (EVT) provides an elegant mathematical tool for statistically analyzing rare events. When data are collected from multiple population subgroups, the scientific interest of researchers would generally be to improve the estimates obtained directly from each subgroup because some subgroups may have less data available for extreme value analysis. To achieve this, we incorporate the mixed effects model (MEM) into the regression technique in EVT. In small area estimation, the MEM has attracted considerable attention as a primary tool for providing reliable estimates for subgroups with small sample sizes, that is, ``small area.'' The key idea of the MEM is to incorporate information from all subgroups into a single model and to borrow strength from other subgroups to improve estimates by subgroup. Using this property, the MEM may contribute to reducing the bias and variance of the direct estimates for each subgroup, which result from the asymptotic specification of EVT. This prompts us to evaluate MEM's effectiveness in EVT through theoretical studies and numerical experiments, including its application to assessing the risk of heavy rainfall in Japan.
△ Less
Submitted 30 September, 2023; v1 submitted 8 May, 2023;
originally announced May 2023.
-
Bayesian Active Questionnaire Design for Cause-of-Death Assignment Using Verbal Autopsies
Authors:
Toshiya Yoshida,
Trinity Shuxian Fan,
Tyler McCormick,
Zhenke Wu,
Zehang Richard Li
Abstract:
Only about one-third of the deaths worldwide are assigned a medically-certified cause, and understanding the causes of deaths occurring outside of medical facilities is logistically and financially challenging. Verbal autopsy (VA) is a routinely used tool to collect information on cause of death in such settings. VA is a survey-based method where a structured questionnaire is conducted to family m…
▽ More
Only about one-third of the deaths worldwide are assigned a medically-certified cause, and understanding the causes of deaths occurring outside of medical facilities is logistically and financially challenging. Verbal autopsy (VA) is a routinely used tool to collect information on cause of death in such settings. VA is a survey-based method where a structured questionnaire is conducted to family members or caregivers of a recently deceased person, and the collected information is used to infer the cause of death. As VA becomes an increasingly routine tool for cause-of-death data collection, the lengthy questionnaire has become a major challenge to the implementation and scale-up of VAs. In this paper, we propose a novel active questionnaire design approach that optimizes the order of the questions dynamically to achieve accurate cause-of-death assignment with the smallest number of questions. We propose a fully Bayesian strategy for adaptive question selection that is compatible with any existing probabilistic cause-of-death assignment methods. We also develop an early stop** criterion that fully accounts for the uncertainty in the model parameters. We also propose a penalized score to account for constraints and preferences of existing question structures. We evaluate the performance of our active designs using both synthetic and real data, demonstrating that the proposed strategy achieves accurate cause-of-death assignment using considerably fewer questions than the traditional static VA survey instruments.
△ Less
Submitted 27 April, 2023; v1 submitted 16 February, 2023;
originally announced February 2023.
-
Spatial prediction of apartment rent using regression-based and machine learning-based approaches with a large dataset
Authors:
Takahiro Yoshida,
Hajime Seya
Abstract:
Employing a large dataset (at most, the order of n = 10^6), this study attempts enhance the literature on the comparison between regression and machine learning (ML)-based rent price prediction models by adding new empirical evidence and considering the spatial dependence of the observations. The regression-based approach incorporates the nearest neighbor Gaussian processes (NNGP) model, enabling…
▽ More
Employing a large dataset (at most, the order of n = 10^6), this study attempts enhance the literature on the comparison between regression and machine learning (ML)-based rent price prediction models by adding new empirical evidence and considering the spatial dependence of the observations. The regression-based approach incorporates the nearest neighbor Gaussian processes (NNGP) model, enabling the application of kriging to large datasets. In contrast, the ML-based approach utilizes typical models: extreme gradient boosting (XGBoost), random forest (RF), and deep neural network (DNN). The out-of-sample prediction accuracy of these models was compared using Japanese apartment rent data, with a varying order of sample sizes (i.e., n = 10^4, 10^5, 10^6). The results showed that, as the sample size increased, XGBoost and RF outperformed NNGP with higher out-of-sample prediction accuracy. XGBoost achieved the highest prediction accuracy for all sample sizes and error measures in both logarithmic and real scales and for all price bands (when n = 10^5 and 10^6). A comparison of several methods to account for the spatial dependence in RF showed that simply adding spatial coordinates to the explanatory variables may be sufficient.
△ Less
Submitted 26 July, 2021;
originally announced July 2021.
-
gwpcorMapper: an interactive map** tool for exploring geographically weighted correlation and partial correlation in high-dimensional geospatial datasets
Authors:
Joseph Emile Honour Percival,
Narumasa Tsutsumida,
Daisuke Murakami,
Takahiro Yoshida,
Tomoki Nakaya
Abstract:
Exploratory spatial data analysis (ESDA) plays a key role in research that includes geographic data. In ESDA, analysts often want to be able to visualize observations and local relationships on a map. However, software dedicated to visualizing local spatial relations be-tween multiple variables in high dimensional datasets remains undeveloped. This paper introduces gwpcorMapper, a newly developed…
▽ More
Exploratory spatial data analysis (ESDA) plays a key role in research that includes geographic data. In ESDA, analysts often want to be able to visualize observations and local relationships on a map. However, software dedicated to visualizing local spatial relations be-tween multiple variables in high dimensional datasets remains undeveloped. This paper introduces gwpcorMapper, a newly developed software application for map** geographically weighted correlation and partial correlation in large multivariate datasets. gwpcorMap-per facilitates ESDA by giving researchers the ability to interact with map components that describe local correlative relationships. We built gwpcorMapper using the R Shiny framework. The software inherits its core algorithm from GWpcor, an R library for calculating the geographically weighted correlation and partial correlation statistics. We demonstrate the application of gwpcorMapper by using it to explore census data in order to find meaningful relationships that describe the work-life environment in the 23 special wards of Tokyo, Japan. We show that gwpcorMapper is useful in both variable selection and parameter tuning for geographically weighted statistics. gwpcorMapper highlights that there are strong statistically clear local variations in the relationship between the number of commuters and the total number of hours worked when considering the total population in each district across the 23 special wards of Tokyo. Our application demonstrates that the ESDA process with high-dimensional geospatial data using gwpcorMapper has applications across multiple fields.
△ Less
Submitted 8 May, 2022; v1 submitted 10 January, 2021;
originally announced January 2021.
-
Gaussian Hierarchical Latent Dirichlet Allocation: Bringing Polysemy Back
Authors:
Takahiro Yoshida,
Ryohei Hisano,
Takaaki Ohnishi
Abstract:
Topic models are widely used to discover the latent representation of a set of documents. The two canonical models are latent Dirichlet allocation, and Gaussian latent Dirichlet allocation, where the former uses multinomial distributions over words, and the latter uses multivariate Gaussian distributions over pre-trained word embedding vectors as the latent topic representations, respectively. Com…
▽ More
Topic models are widely used to discover the latent representation of a set of documents. The two canonical models are latent Dirichlet allocation, and Gaussian latent Dirichlet allocation, where the former uses multinomial distributions over words, and the latter uses multivariate Gaussian distributions over pre-trained word embedding vectors as the latent topic representations, respectively. Compared with latent Dirichlet allocation, Gaussian latent Dirichlet allocation is limited in the sense that it does not capture the polysemy of a word such as ``bank.'' In this paper, we show that Gaussian latent Dirichlet allocation could recover the ability to capture polysemy by introducing a hierarchical structure in the set of topics that the model can use to represent a given document. Our Gaussian hierarchical latent Dirichlet allocation significantly improves polysemy detection compared with Gaussian-based models and provides more parsimonious topic representations compared with hierarchical latent Dirichlet allocation. Our extensive quantitative experiments show that our model also achieves better topic coherence and held-out document predictive accuracy over a wide range of corpus and word embedding vectors.
△ Less
Submitted 7 June, 2023; v1 submitted 25 February, 2020;
originally announced February 2020.
-
Distance Metric Learning for Graph Structured Data
Authors:
Tomoki Yoshida,
Ichiro Takeuchi,
Masayuki Karasuyama
Abstract:
Graphs are versatile tools for representing structured data. As a result, a variety of machine learning methods have been studied for graph data analysis. Although many such learning methods depend on the measurement of differences between input graphs, defining an appropriate distance metric for graphs remains a controversial issue. Hence, we propose a supervised distance metric learning method f…
▽ More
Graphs are versatile tools for representing structured data. As a result, a variety of machine learning methods have been studied for graph data analysis. Although many such learning methods depend on the measurement of differences between input graphs, defining an appropriate distance metric for graphs remains a controversial issue. Hence, we propose a supervised distance metric learning method for the graph classification problem. Our method, named interpretable graph metric learning (IGML), learns discriminative metrics in a subgraph-based feature space, which has a strong graph representation capability. By introducing a sparsity-inducing penalty on the weight of each subgraph, IGML can identify a small number of important subgraphs that can provide insight into the given classification task. Because our formulation has a large number of optimization variables, an efficient algorithm that uses pruning techniques based on safe screening and working set selection methods is also proposed. An important property of IGML is that solution optimality is guaranteed because the problem is formulated as a convex problem and our pruning strategies only discard unnecessary subgraphs. Furthermore, we show that IGML is also applicable to other structured data such as itemset and sequence data, and that it can incorporate vertex-label similarity by using a transportation-based subgraph feature. We empirically evaluate the computational efficiency and classification performance of IGML on several benchmark datasets and provide some illustrative examples of how IGML identifies important subgraphs from a given graph dataset.
△ Less
Submitted 17 June, 2021; v1 submitted 3 February, 2020;
originally announced February 2020.
-
Scalable GWR: A linear-time algorithm for large-scale geographically weighted regression with polynomial kernels
Authors:
Daisuke Murakami,
Narumasa Tsutsumida,
Takahiro Yoshida,
Tomoki Nakaya,
Binbin Lu
Abstract:
Although a number of studies have developed fast geographically weighted regression (GWR) algorithms for large samples, none of them has achieved linear-time estimation, which is considered a requisite for big data analysis in machine learning, geostatistics, and related domains. Against this backdrop, this study proposes a scalable GWR (ScaGWR) for large datasets. The key improvement is the calib…
▽ More
Although a number of studies have developed fast geographically weighted regression (GWR) algorithms for large samples, none of them has achieved linear-time estimation, which is considered a requisite for big data analysis in machine learning, geostatistics, and related domains. Against this backdrop, this study proposes a scalable GWR (ScaGWR) for large datasets. The key improvement is the calibration of the model through a pre-compression of the matrices and vectors whose size depends on the sample size, prior to the leave-one-out cross-validation, which is the heaviest computational step in conventional GWR. This pre-compression allows us to run the proposed GWR extension so that its computation time increases linearly with the sample size. With this improvement, the ScaGWR can be calibrated with one million observations without parallelization. Moreover, the ScaGWR estimator can be regarded as an empirical Bayesian estimator that is more stable than the conventional GWR estimator. We compare the ScaGWR with the conventional GWR in terms of estimation accuracy and computational efficiency using a Monte Carlo simulation. Then, we apply these methods to a US income analysis. The code for ScaGWR is available in the R package scgwr. The code is embedded into C++ code and implemented in another R package, GWmodel.
△ Less
Submitted 23 April, 2020; v1 submitted 1 May, 2019;
originally announced May 2019.
-
Which country epitomizes the world? A study from the perspective of demographic composition
Authors:
Takahiro Yoshida,
Rim Er-Rbib,
Morito Tsutsumi
Abstract:
Demographic indicators are an essential element in considering various problems in the social economy, such as predicting economic fluctuations and establishing policies. The literature widely discusses the growth of the world population or issues pertaining to its aging, but has given little to no attention to population structures and transition patterns. In this article, we take advantage of th…
▽ More
Demographic indicators are an essential element in considering various problems in the social economy, such as predicting economic fluctuations and establishing policies. The literature widely discusses the growth of the world population or issues pertaining to its aging, but has given little to no attention to population structures and transition patterns. In this article, we take advantage of the characteristics of compositional data to examine the transition of the world population structure. Using the Aitchison distance, we examine the similarity of the world population structure from the 1990s to 2080 and that of countries and regions in 2015 and create maps to illustrate the results. Accordingly, we identify the following countries and regions as epitomes of the world population structure through different periods: India, Northern Africa and South Africa, in the 1990s, South America in 2015 to 2030, Oceania and Northern America in 2040, Uruguay and Puerto Rico in 2050 to 2060, and Italy and Japan in the distant future. We then cluster countries based on the similarity of their population structures in 2015 and correspond each cluster to a certain period. We found that Russia and Western Europe gather in a cluster that does not correspond to any period, indicating a recessive population structure.
△ Less
Submitted 29 September, 2018;
originally announced October 2018.
-
Safe Triplet Screening for Distance Metric Learning
Authors:
Tomoki Yoshida,
Ichiro Takeuchi,
Masayuki Karasuyama
Abstract:
We study safe screening for metric learning. Distance metric learning can optimize a metric over a set of triplets, each one of which is defined by a pair of same class instances and an instance in a different class. However, the number of possible triplets is quite huge even for a small dataset. Our safe triplet screening identifies triplets which can be safely removed from the optimization probl…
▽ More
We study safe screening for metric learning. Distance metric learning can optimize a metric over a set of triplets, each one of which is defined by a pair of same class instances and an instance in a different class. However, the number of possible triplets is quite huge even for a small dataset. Our safe triplet screening identifies triplets which can be safely removed from the optimization problem without losing the optimality. Compared with existing safe screening studies, triplet screening is particularly significant because of (1) the huge number of possible triplets, and (2) the semi-definite constraint in the optimization. We derive several variants of screening rules, and analyze their relationships. Numerical experiments on benchmark datasets demonstrate the effectiveness of safe triplet screening.
△ Less
Submitted 5 October, 2018; v1 submitted 12 February, 2018;
originally announced February 2018.
-
A Moran coefficient-based mixed effects approach to investigate spatially varying relationships
Authors:
Daisuke Murakami,
Takahiro Yoshida,
Hajime Seya,
Daniel A. Griffith,
Yoshiki Yamagata
Abstract:
This study develops a spatially varying coefficient model by extending the random effects eigenvector spatial filtering model. The developed model has the following properties: its coefficients are interpretable in terms of the Moran coefficient; each of its coefficients can have a different degree of spatial smoothness; and it yields a variant of a Bayesian spatially varying coefficient model. Al…
▽ More
This study develops a spatially varying coefficient model by extending the random effects eigenvector spatial filtering model. The developed model has the following properties: its coefficients are interpretable in terms of the Moran coefficient; each of its coefficients can have a different degree of spatial smoothness; and it yields a variant of a Bayesian spatially varying coefficient model. Also, parameter estimation of the model can be executed with a relatively small computationally burden. Results of a Monte Carlo simulation reveal that our model outperforms a conventional eigenvector spatial filtering (ESF) model and geographically weighted regression (GWR) models in terms of the accuracy of the coefficient estimates and computational time. We empirically apply our model to the hedonic land price analysis of flood risk in Japan.
△ Less
Submitted 10 August, 2016; v1 submitted 22 June, 2016;
originally announced June 2016.