Search | arXiv e-print repository

Estimation of conditional average treatment effects on distributed data: A privacy-preserving approach

Authors: Yuji Kawamata, Ryoki Motai, Yukihiko Okada, Akira Imakura, Tetsuya Sakurai

Abstract: Estimation of conditional average treatment effects (CATEs) is an important topic in sciences. CATEs can be estimated with high accuracy if distributed data across multiple parties can be centralized. However, it is difficult to aggregate such data owing to privacy concerns. To address this issue, we proposed data collaboration double machine learning, a method that can estimate CATE models with p… ▽ More Estimation of conditional average treatment effects (CATEs) is an important topic in sciences. CATEs can be estimated with high accuracy if distributed data across multiple parties can be centralized. However, it is difficult to aggregate such data owing to privacy concerns. To address this issue, we proposed data collaboration double machine learning, a method that can estimate CATE models with privacy preservation of distributed data, and evaluated the method through simulations. Our contributions are summarized in the following three points. First, our method enables estimation and testing of semi-parametric CATE models without iterative communication on distributed data. Semi-parametric CATE models enable estimation and testing that is more robust to model mis-specification than parametric models. Second, our method enables collaborative estimation between multiple time points and different parties. Third, our method performed equally or better than other methods in simulations using synthetic, semi-synthetic and real-world datasets. △ Less

Submitted 25 May, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

Comments: 51 pages, 11 figures

arXiv:2311.03225 [pdf, other]

Dichotomies for Tree Minor Containment with Structural Parameters

Authors: Tatsuya Gima, Soh Kumabe, Kazuhiro Kurita, Yuto Okada, Yota Otachi

Abstract: The problem of determining whether a graph $G$ contains another graph $H$ as a minor, referred to as the minor containment problem, is a fundamental problem in the field of graph algorithms. While it is NP-complete when $G$ and $H$ are general graphs, it is sometimes tractable on more restricted graph classes. This study focuses on the case where both $G$ and $H$ are trees, known as the tree minor… ▽ More The problem of determining whether a graph $G$ contains another graph $H$ as a minor, referred to as the minor containment problem, is a fundamental problem in the field of graph algorithms. While it is NP-complete when $G$ and $H$ are general graphs, it is sometimes tractable on more restricted graph classes. This study focuses on the case where both $G$ and $H$ are trees, known as the tree minor containment problem. Even in this case, the problem is known to be NP-complete. In contrast, polynomial-time algorithms are known for the case when both trees are caterpillars or when the maximum degree of $H$ is a constant. Our research aims to clarify the boundary of tractability and intractability for the tree minor containment problem. Specifically, we provide dichotomies for the computational complexities of the problem based on three structural parameters: the diameter, pathwidth, and path eccentricity. △ Less

Submitted 6 November, 2023; originally announced November 2023.

Comments: 25 pages, 4 figures, WALCOM 2024

arXiv:2210.02835 [pdf, other]

Sequentially Swap** Tokens: Further on Graph Classes

Authors: Hironori Kiya, Yuto Okada, Hirotaka Ono, Yota Otachi

Abstract: We study the following variant of the 15 puzzle. Given a graph and two token placements on the vertices, we want to find a walk of the minimum length (if any exists) such that the sequence of token swap**s along the walk obtains one of the given token placements from the other one. This problem was introduced as Sequential Token Swap** by Yamanaka et al. [JGAA 2019], who showed that the proble… ▽ More We study the following variant of the 15 puzzle. Given a graph and two token placements on the vertices, we want to find a walk of the minimum length (if any exists) such that the sequence of token swap**s along the walk obtains one of the given token placements from the other one. This problem was introduced as Sequential Token Swap** by Yamanaka et al. [JGAA 2019], who showed that the problem is intractable in general but polynomial-time solvable for trees, complete graphs, and cycles. In this paper, we present a polynomial-time algorithm for block-cactus graphs, which include all previously known cases. We also present general tools for showing the hardness of the problem on restricted graph classes such as chordal graphs and chordal bipartite graphs. We also show that the problem is hard on grids and king's graphs, which are the graphs corresponding to the 15 puzzle and its variant with relaxed moves. △ Less

Submitted 9 March, 2023; v1 submitted 6 October, 2022; originally announced October 2022.

Comments: 24 pages, 15 figures, SOFSEM 2023

arXiv:2208.14611 [pdf, other]

Non-readily identifiable data collaboration analysis for multiple datasets including personal information

Authors: Akira Imakura, Tetsuya Sakurai, Yukihiko Okada, Tomoya Fujii, Teppei Sakamoto, Hiroyuki Abe

Abstract: Multi-source data fusion, in which multiple data sources are jointly analyzed to obtain improved information, has considerable research attention. For the datasets of multiple medical institutions, data confidentiality and cross-institutional communication are critical. In such cases, data collaboration (DC) analysis by sharing dimensionality-reduced intermediate representations without iterative… ▽ More Multi-source data fusion, in which multiple data sources are jointly analyzed to obtain improved information, has considerable research attention. For the datasets of multiple medical institutions, data confidentiality and cross-institutional communication are critical. In such cases, data collaboration (DC) analysis by sharing dimensionality-reduced intermediate representations without iterative cross-institutional communications may be appropriate. Identifiability of the shared data is essential when analyzing data including personal information. In this study, the identifiability of the DC analysis is investigated. The results reveals that the shared intermediate representations are readily identifiable to the original data for supervised learning. This study then proposes a non-readily identifiable DC analysis only sharing non-readily identifiable data for multiple medical datasets including personal information. The proposed method solves identifiability concerns based on a random sample permutation, the concept of interpretable DC analysis, and usage of functions that cannot be reconstructed. In numerical experiments on medical datasets, the proposed method exhibits a non-readily identifiability while maintaining a high recognition performance of the conventional DC analysis. For a hospital dataset, the proposed method exhibits a nine percentage point improvement regarding the recognition performance over the local analysis that uses only local dataset. △ Less

Submitted 30 August, 2022; originally announced August 2022.

Comments: 19 pages, 3 figures, 7 tables

arXiv:2208.12458 [pdf, other]

Another Use of SMOTE for Interpretable Data Collaboration Analysis

Authors: Akira Imakura, Masateru Kihira, Yukihiko Okada, Tetsuya Sakurai

Abstract: Recently, data collaboration (DC) analysis has been developed for privacy-preserving integrated analysis across multiple institutions. DC analysis centralizes individually constructed dimensionality-reduced intermediate representations and realizes integrated analysis via collaboration representations without sharing the original data. To construct the collaboration representations, each instituti… ▽ More Recently, data collaboration (DC) analysis has been developed for privacy-preserving integrated analysis across multiple institutions. DC analysis centralizes individually constructed dimensionality-reduced intermediate representations and realizes integrated analysis via collaboration representations without sharing the original data. To construct the collaboration representations, each institution generates and shares a shareable anchor dataset and centralizes its intermediate representation. Although, random anchor dataset functions well for DC analysis in general, using an anchor dataset whose distribution is close to that of the raw dataset is expected to improve the recognition performance, particularly for the interpretable DC analysis. Based on an extension of the synthetic minority over-sampling technique (SMOTE), this study proposes an anchor data construction technique to improve the recognition performance without increasing the risk of data leakage. Numerical results demonstrate the efficiency of the proposed SMOTE-based method over the existing anchor data constructions for artificial and real-world datasets. Specifically, the proposed method achieves 9 percentage point and 38 percentage point performance improvements regarding accuracy and essential feature selection, respectively, over existing methods for an income dataset. The proposed method provides another use of SMOTE not for imbalanced data classifications but for a key technology of privacy-preserving integrated analysis. △ Less

Submitted 26 August, 2022; originally announced August 2022.

Comments: 19 pages, 3 figures, 7 tables

arXiv:2208.07898 [pdf, other]

doi 10.1016/j.eswa.2023.123024

Collaborative causal inference on distributed data

Authors: Yuji Kawamata, Ryoki Motai, Yukihiko Okada, Akira Imakura, Tetsuya Sakurai

Abstract: In recent years, the development of technologies for causal inference with privacy preservation of distributed data has gained considerable attention. Many existing methods for distributed data focus on resolving the lack of subjects (samples) and can only reduce random errors in estimating treatment effects. In this study, we propose a data collaboration quasi-experiment (DC-QE) that resolves the… ▽ More In recent years, the development of technologies for causal inference with privacy preservation of distributed data has gained considerable attention. Many existing methods for distributed data focus on resolving the lack of subjects (samples) and can only reduce random errors in estimating treatment effects. In this study, we propose a data collaboration quasi-experiment (DC-QE) that resolves the lack of both subjects and covariates, reducing random errors and biases in the estimation. Our method involves constructing dimensionality-reduced intermediate representations from private data from local parties, sharing intermediate representations instead of private data for privacy preservation, estimating propensity scores from the shared intermediate representations, and finally, estimating the treatment effects from propensity scores. Through numerical experiments on both artificial and real-world data, we confirm that our method leads to better estimation results than individual analyses. While dimensionality reduction loses some information in the private data and causes performance degradation, we observe that sharing intermediate representations with many parties to resolve the lack of subjects and covariates sufficiently improves performance to overcome the degradation caused by dimensionality reduction. Although external validity is not necessarily guaranteed, our results suggest that DC-QE is a promising method. With the widespread use of our method, intermediate representations can be published as open data to help researchers find causalities and accumulate a knowledge base. △ Less

Submitted 11 January, 2024; v1 submitted 16 August, 2022; originally announced August 2022.

Comments: 16 pages, 4 figures

Journal ref: Expert Systems with Applications, 123024 (2023)

arXiv:2205.08664 [pdf, other]

doi 10.1145/3531348.3532177

Journey of Migrating Millions of Queries on The Cloud

Authors: Taro L. Saito, Naoki Takezoe, Yukihiro Okada, Takako Shimamoto, Dongmin Yu, Suprith Chandrashekharachar, Kai Sasaki, Shohei Okumiya, Yan Wang, Takashi Kurihara, Ryu Kobayashi, Keisuke Suzuki, Zhenghong Yang, Makoto Onizuka

Abstract: Treasure Data is processing millions of distributed SQL queries every day on the cloud. Upgrading the query engine service at this scale is challenging because we need to migrate all of the production queries of the customers to a new version while preserving the correctness and performance of the data processing pipelines. To ensure the quality of the query engines, we utilize our query logs to b… ▽ More Treasure Data is processing millions of distributed SQL queries every day on the cloud. Upgrading the query engine service at this scale is challenging because we need to migrate all of the production queries of the customers to a new version while preserving the correctness and performance of the data processing pipelines. To ensure the quality of the query engines, we utilize our query logs to build customer-specific benchmarks and replay these queries with real customer data in a secure pre-production environment. To simulate millions of queries, we need effective minimization of test query sets and better reporting of the simulation results to proactively find incompatible changes and performance regression of the new version. This paper describes the overall design of our system and shares various challenges in maintaining the quality of the query engine service on the cloud. △ Less

Submitted 17 May, 2022; originally announced May 2022.

Comments: This version is published in DBTest '22: Proceedings of the 2022 workshop on 9th International Workshop of Testing Database Systems

MSC Class: 68P20 ACM Class: H.2.4; D.2.9

arXiv:2011.06803 [pdf, other]

Federated Learning System without Model Sharing through Integration of Dimensional Reduced Data Representations

Authors: Anna Bogdanova, Akie Nakai, Yukihiko Okada, Akira Imakura, Tetsuya Sakurai

Abstract: Dimensionality Reduction is a commonly used element in a machine learning pipeline that helps to extract important features from high-dimensional data. In this work, we explore an alternative federated learning system that enables integration of dimensionality reduced representations of distributed data prior to a supervised learning task, thus avoiding model sharing among the parties. We compare… ▽ More Dimensionality Reduction is a commonly used element in a machine learning pipeline that helps to extract important features from high-dimensional data. In this work, we explore an alternative federated learning system that enables integration of dimensionality reduced representations of distributed data prior to a supervised learning task, thus avoiding model sharing among the parties. We compare the performance of this approach on image classification tasks to three alternative frameworks: centralized machine learning, individual machine learning, and Federated Averaging, and analyze potential use cases for a federated learning system without model sharing. Our results show that our approach can achieve similar accuracy as Federated Averaging and performs better than Federated Averaging in a small-user setting. △ Less

Submitted 13 November, 2020; originally announced November 2020.

Comments: 6 pages with 4 figures. To be presented at the Workshop on Federated Learning for Data Privacy and Confidentiality in Conjunction with IJCAI 2020 (FL-IJCAI'20)

arXiv:2011.04437 [pdf, other]

Interpretable collaborative data analysis on distributed data

Authors: Akira Imakura, Hiroaki Inaba, Yukihiko Okada, Tetsuya Sakurai

Abstract: This paper proposes an interpretable non-model sharing collaborative data analysis method as one of the federated learning systems, which is an emerging technology to analyze distributed data. Analyzing distributed data is essential in many applications such as medical, financial, and manufacturing data analyses due to privacy, and confidentiality concerns. In addition, interpretability of the obt… ▽ More This paper proposes an interpretable non-model sharing collaborative data analysis method as one of the federated learning systems, which is an emerging technology to analyze distributed data. Analyzing distributed data is essential in many applications such as medical, financial, and manufacturing data analyses due to privacy, and confidentiality concerns. In addition, interpretability of the obtained model has an important role for practical applications of the federated learning systems. By centralizing intermediate representations, which are individually constructed in each party, the proposed method obtains an interpretable model, achieving a collaborative analysis without revealing the individual data and learning model distributed over local parties. Numerical experiments indicate that the proposed method achieves better recognition performance for artificial and real-world problems than individual analysis. △ Less

Submitted 9 November, 2020; originally announced November 2020.

Comments: 16 pages, 3 figures, 3 tables

arXiv:2003.05127 [pdf]

Hundred Drones Land in a Minute

Authors: Daiki Fujikura, Kenjiro Tadakuma, Masahiro Watanabe, Yoshito Okada, Kazunori Ohno, Satoshi Tadokoro

Abstract: Currently, drone research and development has received significant attention worldwide. Particularly, delivery services employ drones as it is a viable method to improve delivery efficiency by using a several unmanned drones. Research has been conducted to realize complete automation of drone control for such services. However, regarding the takeoff and landing port of the drones, conventional met… ▽ More Currently, drone research and development has received significant attention worldwide. Particularly, delivery services employ drones as it is a viable method to improve delivery efficiency by using a several unmanned drones. Research has been conducted to realize complete automation of drone control for such services. However, regarding the takeoff and landing port of the drones, conventional methods have focused on the landing operation of a single drone, and the continuous landing of multiple drones has not been realized. To address this issue, we propose a completely novel port system, "EAGLES Port," that allows several drones to continuously land and takeoff in a short time. Experiments verified that the landing time efficiency of the proposed port is ideally 7.5 times higher than that of conventional vertical landing systems. Moreover, the system can tolerate 270 mm of horizontal positional error, +-30 deg of angular error in the drone's approach (+-40 deg with the proposed gate mechanism), and up to 1.9 m/s of drone's approach speed. This technology significantly contributes to the scalability of drone usage. Therefore, it is critical for the development of a future drone port for the landing of automated drone swarms. △ Less

Submitted 11 March, 2020; originally announced March 2020.

Comments: 8 pages, 12 figures

Showing 1–10 of 10 results for author: Okada, Y