On minimizing the training set fill distance in machine learning regression

Climaco, Paolo; Garcke, Jochen

Computer Science > Machine Learning

arXiv:2307.10988 (cs)

[Submitted on 20 Jul 2023 (v1), last revised 5 Dec 2023 (this version, v2)]

Title:On minimizing the training set fill distance in machine learning regression

Authors:Paolo Climaco, Jochen Garcke

View PDF

Abstract:For regression tasks one often leverages large datasets for training predictive machine learning models. However, using large datasets may not be feasible due to computational limitations or high data labelling costs. Therefore, suitably selecting small training sets from large pools of unlabelled data points is essential to maximize model performance while maintaining efficiency. In this work, we study Farthest Point Sampling (FPS), a data selection approach that aims to minimize the fill distance of the selected set. We derive an upper bound for the maximum expected prediction error, conditional to the location of the unlabelled data points, that linearly depends on the training set fill distance. For empirical validation, we perform experiments using two regression models on three datasets. We empirically show that selecting a training set by aiming to minimize the fill distance, thereby minimizing our derived bound, significantly reduces the maximum prediction error of various regression models, outperforming alternative sampling approaches by a large margin. Furthermore, we show that selecting training sets with the FPS can also increase model stability for the specific case of Gaussian kernel regression approaches.

Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2307.10988 [cs.LG]
	(or arXiv:2307.10988v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2307.10988

Submission history

From: Paolo Climaco [view email]
[v1] Thu, 20 Jul 2023 16:18:33 UTC (3,086 KB)
[v2] Tue, 5 Dec 2023 13:23:55 UTC (781 KB)

Computer Science > Machine Learning

Title:On minimizing the training set fill distance in machine learning regression

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:On minimizing the training set fill distance in machine learning regression

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators