-
Feature space reduction method for ultrahigh-dimensional, multiclass data: Random forest-based multiround screening (RFMS)
Authors:
Gergely Hanczár,
Marcell Stip**er,
Dávid Hanák,
Marcell T. Kurbucz,
Olivér M. Törteli,
Ágnes Chripkó,
Zoltán Somogyvári
Abstract:
In recent years, numerous screening methods have been published for ultrahigh-dimensional data that contain hundreds of thousands of features; however, most of these features cannot handle data with thousands of classes. Prediction models built to authenticate users based on multichannel biometric data result in this type of problem. In this study, we present a novel method known as random forest-…
▽ More
In recent years, numerous screening methods have been published for ultrahigh-dimensional data that contain hundreds of thousands of features; however, most of these features cannot handle data with thousands of classes. Prediction models built to authenticate users based on multichannel biometric data result in this type of problem. In this study, we present a novel method known as random forest-based multiround screening (RFMS) that can be effectively applied under such circumstances. The proposed algorithm divides the feature space into small subsets and executes a series of partial model builds. These partial models are used to implement tournament-based sorting and the selection of features based on their importance. To benchmark RFMS, a synthetic biometric feature space generator known as BiometricBlender is employed. Based on the results, the RFMS is on par with industry-standard feature screening methods while simultaneously possessing many advantages over these methods.
△ Less
Submitted 25 May, 2023;
originally announced May 2023.
-
BiometricBlender: Ultra-high dimensional, multi-class synthetic data generator to imitate biometric feature space
Authors:
Marcell Stip**er,
Dávid Hanák,
Marcell T. Kurbucz,
Gergely Hanczár,
Olivér M. Törteli,
Zoltán Somogyvári
Abstract:
The lack of freely available (real-life or synthetic) high or ultra-high dimensional, multi-class datasets may hamper the rapidly growing research on feature screening, especially in the field of biometrics, where the usage of such datasets is common. This paper reports a Python package called BiometricBlender, which is an ultra-high dimensional, multi-class synthetic data generator to benchmark a…
▽ More
The lack of freely available (real-life or synthetic) high or ultra-high dimensional, multi-class datasets may hamper the rapidly growing research on feature screening, especially in the field of biometrics, where the usage of such datasets is common. This paper reports a Python package called BiometricBlender, which is an ultra-high dimensional, multi-class synthetic data generator to benchmark a wide range of feature screening methods. During the data generation process, the overall usefulness and the intercorrelations of blended features can be controlled by the user, thus the synthetic feature space is able to imitate the key properties of a real biometric dataset.
△ Less
Submitted 25 April, 2023; v1 submitted 21 June, 2022;
originally announced June 2022.
-
Manifold-adaptive dimension estimation revisited
Authors:
Zsigmond Benkő,
Marcell Stip**er,
Roberta Rehus,
Attila Bencze,
Dániel Fabó,
Boglárka Hajnal,
Loránd Erőss,
András Telcs,
Zoltán Somogyvári
Abstract:
Data dimensionality informs us about data complexity and sets limit on the structure of successful signal processing pipelines. In this work we revisit and improve the manifold-adaptive Farahmand-Szepesvári-Audibert (FSA) dimension estimator, making it one of the best nearest neighbor-based dimension estimators available. We compute the probability density function of local FSA estimates, if the l…
▽ More
Data dimensionality informs us about data complexity and sets limit on the structure of successful signal processing pipelines. In this work we revisit and improve the manifold-adaptive Farahmand-Szepesvári-Audibert (FSA) dimension estimator, making it one of the best nearest neighbor-based dimension estimators available. We compute the probability density function of local FSA estimates, if the local manifold density is uniform. Based on the probability density function, we propose to use the median of local estimates as a basic global measure of intrinsic dimensionality, and we demonstrate the advantages of this asymptotically unbiased estimator over the previously proposed statistics: the mode and the mean. Additionally, from the probability density function, we derive the maximum likelihood formula for global intrinsic dimensionality, if i.i.d. holds. We tackle edge and finite-sample effects with an exponential correction formula, calibrated on hypercube datasets. We compare the performance of the corrected-median-FSA estimator with kNN estimators: maximum likelihood (ML, Levina-Bickel) and two implementations of DANCo (R and matlab). We show that corrected-median-FSA estimator beats the ML estimator and it is on equal footing with DANCo for standard synthetic benchmarks according to mean percentage error and error rate metrics. With the median-FSA algorithm, we reveal diverse changes in the neural dynamics while resting state and during epileptic seizures. We identify brain areas with lower-dimensional dynamics that are possible causal sources and candidates for being seizure onset zones.
△ Less
Submitted 10 August, 2020; v1 submitted 7 August, 2020;
originally announced August 2020.
-
Complete Inference of Causal Relations between Dynamical Systems
Authors:
Zsigmond Benkő,
Ádám Zlatniczki,
Marcell Stip**er,
Dániel Fabó,
András Sólyom,
Loránd Erőss,
András Telcs,
Zoltán Somogyvári
Abstract:
From ancient philosophers to modern economists, biologists, and other researchers, there has been a continuous effort to unveil causal relations. The most formidable challenge lies in deducing the nature of the causal relationship: whether it is unidirectional, bidirectional, or merely apparent - implied by an unobserved common cause.
While modern technology equips us with tools to collect data…
▽ More
From ancient philosophers to modern economists, biologists, and other researchers, there has been a continuous effort to unveil causal relations. The most formidable challenge lies in deducing the nature of the causal relationship: whether it is unidirectional, bidirectional, or merely apparent - implied by an unobserved common cause.
While modern technology equips us with tools to collect data from intricate systems such as the planet's ecosystem or the human brain, comprehending their functioning requires the identification and differentiation of causal relationships among the components, all without external interventions.
In this context, we introduce a novel method capable of distinguishing and assigning probabilities to the presence of all potential causal relations between two or more time series within dynamical systems. The efficacy of this method is verified using synthetic datasets and applied to EEG (electroencephalographic) data recorded from epileptic patients.
Given the universal applicability of our method, it holds promise for diverse scientific fields.
△ Less
Submitted 17 January, 2024; v1 submitted 31 August, 2018;
originally announced August 2018.
-
Universality and scaling laws in the cascading failure model with healing
Authors:
Marcell Stip**er,
János Kertész
Abstract:
Cascading failures may lead to dramatic collapse in interdependent networks, where the breakdown takes place as a discontinuity of the order parameter. In the cascading failure (CF) model with healing there is a control parameter which at some value suppresses the discontinuity of the order parameter. However, up to this value of the healing parameter the breakdown is a hybrid transition, meaning…
▽ More
Cascading failures may lead to dramatic collapse in interdependent networks, where the breakdown takes place as a discontinuity of the order parameter. In the cascading failure (CF) model with healing there is a control parameter which at some value suppresses the discontinuity of the order parameter. However, up to this value of the healing parameter the breakdown is a hybrid transition, meaning that, besides this first order character, the transition shows scaling too. In this paper we investigate the question of universality related to the scaling behavior. Recently we showed that the hybrid phase transition in the original CF model has two sets of exponents describing respectively the order parameter and the cascade statistics, which are connected by a scaling law. In the CF model with healing we measure these exponents as a function of the healing parameter. We find two universality classes: In the wide range below the critical healing value the exponents agree with those of the original model, while above this value the model displays trivial scaling meaning that fluctuations follow the central limit theorem.
△ Less
Submitted 22 December, 2017; v1 submitted 27 May, 2017;
originally announced May 2017.
-
Hybrid Phase Transition into an Absorbing State: Percolation and Avalanches
Authors:
Deokjae Lee,
S. Choi,
M. Stip**er,
J. Kertész,
B. Kahng
Abstract:
Interdependent networks are more fragile under random attacks than simplex networks, because interlayer dependencies lead to cascading failures and finally to a sudden collapse. This is a hybrid phase transition (HPT), meaning that at the transition point the order parameter has a jump but there are also critical phenomena related to it. Here we study these phenomena on the Erdős--Rényi and the tw…
▽ More
Interdependent networks are more fragile under random attacks than simplex networks, because interlayer dependencies lead to cascading failures and finally to a sudden collapse. This is a hybrid phase transition (HPT), meaning that at the transition point the order parameter has a jump but there are also critical phenomena related to it. Here we study these phenomena on the Erdős--Rényi and the two dimensional interdependent networks and show that the hybrid percolation transition exhibits two kinds of critical behaviors: divergence of the fluctuations of the order parameter and power-law size distribution of finite avalanches at a transition point. At the transition point, avalanches of infinite size occur thus the avalanche statistics also has the nature of a HPT. The exponent $β_m$ of the order parameter is $1/2$ under general conditions, while the value of the exponent $γ_m$ characterizing the fluctuations of the order parameter depends on the system. The critical behavior of the finite avalanches can be described by another set of exponents, $β_a$ and $γ_a$. These two critical behaviors are coupled by a scaling law: $1-β_m=γ_a$.
△ Less
Submitted 11 April, 2016; v1 submitted 28 December, 2015;
originally announced December 2015.
-
Enhancing resilience of interdependent networks by healing
Authors:
Marcell Stip**er,
János Kertész
Abstract:
Interdependent networks are characterized by two kinds of interactions: The usual connectivity links within each network and the dependency links coupling nodes of different networks. Due to the latter links such networks are known to suffer from cascading failures and catastrophic breakdowns. When modeling these phenomena, usually one assumes that a fraction of nodes gets damaged in one of the ne…
▽ More
Interdependent networks are characterized by two kinds of interactions: The usual connectivity links within each network and the dependency links coupling nodes of different networks. Due to the latter links such networks are known to suffer from cascading failures and catastrophic breakdowns. When modeling these phenomena, usually one assumes that a fraction of nodes gets damaged in one of the networks, which is followed possibly by a cascade of failures. In real life the initiating failures do not occur at once and effort is made replace the ties eliminated due to the failing nodes. Here we study a dynamic extension of the model of interdependent networks and introduce the possibility of link formation with a probability w, called healing, to bridge non-functioning nodes and enhance network resilience. A single random node is removed, which may initiate an avalanche. After each removal step healing sets in resulting in a new topology. Then a new node fails and the process continues until the giant component disappears either in a catastrophic breakdown or in a smooth transition. Simulation results are presented for square lattices as starting networks under random attacks of constant intensity. We find that the shift in the position of the breakdown has a power-law scaling as a function of the healing probability with an exponent close to 1. Below a critical healing probability, catastrophic cascades form and the average degree of surviving nodes decreases monotonically, while above this value there are no macroscopic cascades and the average degree has first an increasing character and decreases only at the very late stage of the process. These findings facilitate to plan intervention in case of crisis situation by describing the efficiency of healing efforts needed to suppress cascading failures.
△ Less
Submitted 4 September, 2014; v1 submitted 6 December, 2013;
originally announced December 2013.
-
Analytic results and weighted Monte Carlo simulations for CDO pricing
Authors:
Marcell Stip**er,
Bálint Vető,
Éva Rácz,
Zsolt Bihary
Abstract:
We explore the possibilities of importance sampling in the Monte Carlo pricing of a structured credit derivative referred to as Collateralized Debt Obligation (CDO). Modeling a CDO contract is challenging, since it depends on a pool of (typically about 100) assets, Monte Carlo simulations are often the only feasible approach to pricing. Variance reduction techniques are therefore of great importan…
▽ More
We explore the possibilities of importance sampling in the Monte Carlo pricing of a structured credit derivative referred to as Collateralized Debt Obligation (CDO). Modeling a CDO contract is challenging, since it depends on a pool of (typically about 100) assets, Monte Carlo simulations are often the only feasible approach to pricing. Variance reduction techniques are therefore of great importance. This paper presents an exact analytic solution using Laplace-transform and MC importance sampling results for an easily tractable intensity-based model of the CDO, namely the compound Poissonian. Furthermore analytic formulae are derived for the reweighting efficiency. The computational gain is appealing, nevertheless, even in this basic scheme, a phase transition can be found, rendering some parameter regimes out of reach. A model-independent transform approach is also presented for CDO pricing.
△ Less
Submitted 26 May, 2011;
originally announced May 2011.