-
Feature space reduction method for ultrahigh-dimensional, multiclass data: Random forest-based multiround screening (RFMS)
Authors:
Gergely Hanczár,
Marcell Stip**er,
Dávid Hanák,
Marcell T. Kurbucz,
Olivér M. Törteli,
Ágnes Chripkó,
Zoltán Somogyvári
Abstract:
In recent years, numerous screening methods have been published for ultrahigh-dimensional data that contain hundreds of thousands of features; however, most of these features cannot handle data with thousands of classes. Prediction models built to authenticate users based on multichannel biometric data result in this type of problem. In this study, we present a novel method known as random forest-…
▽ More
In recent years, numerous screening methods have been published for ultrahigh-dimensional data that contain hundreds of thousands of features; however, most of these features cannot handle data with thousands of classes. Prediction models built to authenticate users based on multichannel biometric data result in this type of problem. In this study, we present a novel method known as random forest-based multiround screening (RFMS) that can be effectively applied under such circumstances. The proposed algorithm divides the feature space into small subsets and executes a series of partial model builds. These partial models are used to implement tournament-based sorting and the selection of features based on their importance. To benchmark RFMS, a synthetic biometric feature space generator known as BiometricBlender is employed. Based on the results, the RFMS is on par with industry-standard feature screening methods while simultaneously possessing many advantages over these methods.
△ Less
Submitted 25 May, 2023;
originally announced May 2023.
-
BiometricBlender: Ultra-high dimensional, multi-class synthetic data generator to imitate biometric feature space
Authors:
Marcell Stip**er,
Dávid Hanák,
Marcell T. Kurbucz,
Gergely Hanczár,
Olivér M. Törteli,
Zoltán Somogyvári
Abstract:
The lack of freely available (real-life or synthetic) high or ultra-high dimensional, multi-class datasets may hamper the rapidly growing research on feature screening, especially in the field of biometrics, where the usage of such datasets is common. This paper reports a Python package called BiometricBlender, which is an ultra-high dimensional, multi-class synthetic data generator to benchmark a…
▽ More
The lack of freely available (real-life or synthetic) high or ultra-high dimensional, multi-class datasets may hamper the rapidly growing research on feature screening, especially in the field of biometrics, where the usage of such datasets is common. This paper reports a Python package called BiometricBlender, which is an ultra-high dimensional, multi-class synthetic data generator to benchmark a wide range of feature screening methods. During the data generation process, the overall usefulness and the intercorrelations of blended features can be controlled by the user, thus the synthetic feature space is able to imitate the key properties of a real biometric dataset.
△ Less
Submitted 25 April, 2023; v1 submitted 21 June, 2022;
originally announced June 2022.
-
Reconstructing shared dynamics with a deep neural network
Authors:
Zsigmond Benkő,
Zoltán Somogyvári
Abstract:
Determining hidden shared patterns behind dynamic phenomena can be a game-changer in multiple areas of research. Here we present the principles and show a method to identify hidden shared dynamics from time series by a two-module, feedforward neural network architecture: the Mapper-Coach network. We reconstruct unobserved, continuous latent variable input, the time series generated by a chaotic lo…
▽ More
Determining hidden shared patterns behind dynamic phenomena can be a game-changer in multiple areas of research. Here we present the principles and show a method to identify hidden shared dynamics from time series by a two-module, feedforward neural network architecture: the Mapper-Coach network. We reconstruct unobserved, continuous latent variable input, the time series generated by a chaotic logistic map, from the observed values of two simultaneously forced chaotic logistic maps. The network has been trained to predict one of the observed time series based on its own past and conditioned on the other observed time series by error-back propagation. It was shown, that after this prediction have been learned successfully, the activity of the bottleneck neuron, connecting the mapper and the coach module, correlated strongly with the latent shared input variable. The method has the potential to reveal hidden components of dynamical systems, where experimental intervention is not possible.
△ Less
Submitted 14 October, 2022; v1 submitted 5 May, 2021;
originally announced May 2021.
-
Manifold-adaptive dimension estimation revisited
Authors:
Zsigmond Benkő,
Marcell Stip**er,
Roberta Rehus,
Attila Bencze,
Dániel Fabó,
Boglárka Hajnal,
Loránd Erőss,
András Telcs,
Zoltán Somogyvári
Abstract:
Data dimensionality informs us about data complexity and sets limit on the structure of successful signal processing pipelines. In this work we revisit and improve the manifold-adaptive Farahmand-Szepesvári-Audibert (FSA) dimension estimator, making it one of the best nearest neighbor-based dimension estimators available. We compute the probability density function of local FSA estimates, if the l…
▽ More
Data dimensionality informs us about data complexity and sets limit on the structure of successful signal processing pipelines. In this work we revisit and improve the manifold-adaptive Farahmand-Szepesvári-Audibert (FSA) dimension estimator, making it one of the best nearest neighbor-based dimension estimators available. We compute the probability density function of local FSA estimates, if the local manifold density is uniform. Based on the probability density function, we propose to use the median of local estimates as a basic global measure of intrinsic dimensionality, and we demonstrate the advantages of this asymptotically unbiased estimator over the previously proposed statistics: the mode and the mean. Additionally, from the probability density function, we derive the maximum likelihood formula for global intrinsic dimensionality, if i.i.d. holds. We tackle edge and finite-sample effects with an exponential correction formula, calibrated on hypercube datasets. We compare the performance of the corrected-median-FSA estimator with kNN estimators: maximum likelihood (ML, Levina-Bickel) and two implementations of DANCo (R and matlab). We show that corrected-median-FSA estimator beats the ML estimator and it is on equal footing with DANCo for standard synthetic benchmarks according to mean percentage error and error rate metrics. With the median-FSA algorithm, we reveal diverse changes in the neural dynamics while resting state and during epileptic seizures. We identify brain areas with lower-dimensional dynamics that are possible causal sources and candidates for being seizure onset zones.
△ Less
Submitted 10 August, 2020; v1 submitted 7 August, 2020;
originally announced August 2020.
-
How to find a unicorn: a novel model-free, unsupervised anomaly detection method for time series
Authors:
Zsigmond Benkő,
Tamás Bábel,
Zoltán Somogyvári
Abstract:
Recognition of anomalous events is a challenging but critical task in many scientific and industrial fields, especially when the properties of anomalies are unknown. In this paper, we introduce a new anomaly concept called "unicorn" or unique event and present a new, model-free, unsupervised detection algorithm to detect unicorns. The key component of the new algorithm is the Temporal Outlier Fact…
▽ More
Recognition of anomalous events is a challenging but critical task in many scientific and industrial fields, especially when the properties of anomalies are unknown. In this paper, we introduce a new anomaly concept called "unicorn" or unique event and present a new, model-free, unsupervised detection algorithm to detect unicorns. The key component of the new algorithm is the Temporal Outlier Factor (TOF) to measure the uniqueness of events in continuous data sets from dynamic systems. The concept of unique events differs significantly from traditional outliers in many aspects: while repetitive outliers are no longer unique events, a unique event is not necessarily an outlier; it does not necessarily fall out from the distribution of normal activity. The performance of our algorithm was examined in recognizing unique events on different types of simulated data sets with anomalies and it was compared with the Local Outlier Factor (LOF) and discord discovery algorithms. TOF had superior performance compared to LOF and discord algorithms even in recognizing traditional outliers and it also recognized unique events that those did not. The benefits of the unicorn concept and the new detection method were illustrated by example data sets from very different scientific fields. Our algorithm successfully recognized unique events in those cases where they were already known such as the gravitational waves of a binary black hole merger on LIGO detector data and the signs of respiratory failure on ECG data series. Furthermore, unique events were found on the LIBOR data set of the last 30 years.
△ Less
Submitted 15 June, 2021; v1 submitted 23 April, 2020;
originally announced April 2020.
-
Prediction of Emerging Technologies Based on Analysis of the U.S. Patent Citation Network
Authors:
Péter Érdi,
Kinga Makovi,
Zoltán Somogyvári,
Katherine Strandburg,
Jan Tobochnik,
Péter Volf,
László Zalányi
Abstract:
The network of patents connected by citations is an evolving graph, which provides a representation of the innovation process. A patent citing another implies that the cited patent reflects a piece of previously existing knowledge that the citing patent builds upon. A methodology presented here (i) identifies actual clusters of patents: i.e. technological branches, and (ii) gives predictions about…
▽ More
The network of patents connected by citations is an evolving graph, which provides a representation of the innovation process. A patent citing another implies that the cited patent reflects a piece of previously existing knowledge that the citing patent builds upon. A methodology presented here (i) identifies actual clusters of patents: i.e. technological branches, and (ii) gives predictions about the temporal changes of the structure of the clusters. A predictor, called the {citation vector}, is defined for characterizing technological development to show how a patent cited by other patents belongs to various industrial fields. The clustering technique adopted is able to detect the new emerging recombinations, and predicts emerging new technology clusters. The predictive ability of our new method is illustrated on the example of USPTO subcategory 11, Agriculture, Food, Textiles. A cluster of patents is determined based on citation data up to 1991, which shows significant overlap of the class 442 formed at the beginning of 1997. These new tools of predictive analytics could support policy decision making processes in science and technology, and help formulate recommendations for action.
△ Less
Submitted 4 April, 2013; v1 submitted 18 June, 2012;
originally announced June 2012.