-
Delta-Closure Structure for Studying Data Distribution
Authors:
Aleksey Buzmakov,
Tatiana Makhalova,
Sergei O. Kuznetsov,
Amedeo Napoli
Abstract:
In this paper, we revisit pattern mining and study the distribution underlying a binary dataset thanks to the closure structure which is based on passkeys, i.e., minimum generators in equivalence classes robust to noise. We introduce $Δ$-closedness, a generalization of the closure operator, where $Δ$ measures how a closed set differs from its upper neighbors in the partial order induced by closure…
▽ More
In this paper, we revisit pattern mining and study the distribution underlying a binary dataset thanks to the closure structure which is based on passkeys, i.e., minimum generators in equivalence classes robust to noise. We introduce $Δ$-closedness, a generalization of the closure operator, where $Δ$ measures how a closed set differs from its upper neighbors in the partial order induced by closure. A $Δ$-class of equivalence includes minimum and maximum elements and allows us to characterize the distribution underlying the data. Moreover, the set of $Δ$-classes of equivalence can be partitioned into the so-called $Δ$-closure structure. In particular, a $Δ$-class of equivalence with a high level demonstrates correlations among many attributes, which are supported by more observations when $Δ$ is large. In the experiments, we study the $Δ$-closure structure of several real-world datasets and show that this structure is very stable for large $Δ$ and does not substantially depend on the data sampling used for the analysis.
△ Less
Submitted 13 October, 2022;
originally announced October 2022.
-
Experimental Study of Concise Representations of Concepts and Dependencies
Authors:
Aleksey Buzmakov,
Egor Dudyrev,
Sergei O. Kuznetsov,
Tatiana Makhalova,
Amedeo Napoli
Abstract:
In this paper we are interested in studying concise representations of concepts and dependencies, i.e., implications and association rules. Such representations are based on equivalence classes and their elements, i.e., minimal generators, minimum generators including keys and passkeys, proper premises, and pseudo-intents. All these sets of attributes are significant and well studied from the comp…
▽ More
In this paper we are interested in studying concise representations of concepts and dependencies, i.e., implications and association rules. Such representations are based on equivalence classes and their elements, i.e., minimal generators, minimum generators including keys and passkeys, proper premises, and pseudo-intents. All these sets of attributes are significant and well studied from the computational point of view, while their statistical properties remain to be studied. This is the purpose of this paper to study these singular attribute sets and in parallel to study how to evaluate the complexity of a dataset from an FCA point of view. In the paper we analyze the empirical distributions and the sizes of these particular attribute sets. In addition we propose several measures of data complexity, such as distributivity, linearity, size of concepts, size of minimum generators, for the analysis of real-world and synthetic datasets.
△ Less
Submitted 24 November, 2022; v1 submitted 5 May, 2022;
originally announced May 2022.
-
Discovery data topology with the closure structure. Theoretical and practical aspects
Authors:
Tatiana Makhalova,
Aleksey Buzmakov,
Sergei O. Kuznetsov,
Amedeo Napoli
Abstract:
In this paper, we are revisiting pattern mining and especially itemset mining, which allows one to analyze binary datasets in searching for interesting and meaningful association rules and respective itemsets in an unsupervised way. While a summarization of a dataset based on a set of patterns does not provide a general and satisfying view over a dataset, we introduce a concise representation -- t…
▽ More
In this paper, we are revisiting pattern mining and especially itemset mining, which allows one to analyze binary datasets in searching for interesting and meaningful association rules and respective itemsets in an unsupervised way. While a summarization of a dataset based on a set of patterns does not provide a general and satisfying view over a dataset, we introduce a concise representation -- the closure structure -- based on closed itemsets and their minimum generators, for capturing the intrinsic content of a dataset. The closure structure allows one to understand the topology of the dataset in the whole and the inherent complexity of the data. We propose a formalization of the closure structure in terms of Formal Concept Analysis, which is well adapted to study this data topology. We present and demonstrate theoretical results, and as well, practical results using the GDPM algorithm. GDPM is rather unique in its functionality as it returns a characterization of the topology of a dataset in terms of complexity levels, highlighting the diversity and the distribution of the itemsets. Finally, a series of experiments shows how GDPM can be practically used and what can be expected from the output.
△ Less
Submitted 30 March, 2021; v1 submitted 6 October, 2020;
originally announced October 2020.
-
The Comparison of Methods for Individual Treatment Effect Detection
Authors:
Aleksey Buzmakov,
Daria Semenova,
Maria Temirkaeva
Abstract:
Today, treatment effect estimation at the individual level is a vital problem in many areas of science and business. For example, in marketing, estimates of the treatment effect are used to select the most efficient promo-mechanics; in medicine, individual treatment effects are used to determine the optimal dose of medication for each patient and so on. At the same time, the question on choosing t…
▽ More
Today, treatment effect estimation at the individual level is a vital problem in many areas of science and business. For example, in marketing, estimates of the treatment effect are used to select the most efficient promo-mechanics; in medicine, individual treatment effects are used to determine the optimal dose of medication for each patient and so on. At the same time, the question on choosing the best method, i.e., the method that ensures the smallest predictive error (for instance, RMSE) or the highest total (average) value of the effect, remains open. Accordingly, in this paper we compare the effectiveness of machine learning methods for estimation of individual treatment effects. The comparison is performed on the Criteo Uplift Modeling Dataset. In this paper we show that the combination of the Logistic Regression method and the Difference Score method as well as Uplift Random Forest method provide the best correctness of Individual Treatment Effect prediction on the top 30\% observations of the test dataset.
△ Less
Submitted 3 December, 2019;
originally announced December 2019.
-
Segmentation Criteria in the Problem of Porosity Determination based on CT Scans
Authors:
V. Kokhan,
M. Grigoriev,
A. Buzmakov,
V. Uvarov,
A. Ingacheva,
E. Shvets,
M. Chukalina
Abstract:
Porous materials are widely used in different applications, in particular they are used to create various filters. Their quality depends on parameters that characterize the internal structure such as porosity, permeability and so on. Computed tomography (CT) allows one to see the internal structure of a porous object without destroying it. The result of tomography is a gray image. To evaluate the…
▽ More
Porous materials are widely used in different applications, in particular they are used to create various filters. Their quality depends on parameters that characterize the internal structure such as porosity, permeability and so on. Computed tomography (CT) allows one to see the internal structure of a porous object without destroying it. The result of tomography is a gray image. To evaluate the desired parameters, the image should be segmented. Traditional intensity threshold approaches did not reliably produce correct results due to limitations with CT images quality. Errors in the evaluation of characteristics of porous materials based on segmented images can lead to the incorrect estimation of their quality and consequently to the impossibility of exploitation, financial losses and even to accidents. It is difficult to perform correctly segmentation due to the strong difference in voxel intensities of the reconstructed object and the presence of noise. Image filtering as a preprocessing procedure is used to improve the quality of segmentation. Nevertheless, there is a problem of choosing an optimal filter. In this work, a method for selecting an optimal filter based on attributive indicator of porous objects (should be free from 'levitating stones' inside of pores) is proposed. In this paper, we use real data where beam hardening artifacts are removed, which allows us to focus on the noise reduction process
△ Less
Submitted 16 October, 2019;
originally announced October 2019.
-
Machine learning for subgroup discovery under treatment effect
Authors:
Aleksey Buzmakov
Abstract:
In many practical tasks it is needed to estimate an effect of treatment on individual level. For example, in medicine it is essential to determine the patients that would benefit from a certain medicament. In marketing, knowing the persons that are likely to buy a new product would reduce the amount of spam. In this chapter, we review the methods to estimate an individual treatment effect from a r…
▽ More
In many practical tasks it is needed to estimate an effect of treatment on individual level. For example, in medicine it is essential to determine the patients that would benefit from a certain medicament. In marketing, knowing the persons that are likely to buy a new product would reduce the amount of spam. In this chapter, we review the methods to estimate an individual treatment effect from a randomized trial, i.e., an experiment when a part of individuals receives a new treatment, while the others do not. Finally, it is shown that new efficient methods are needed in this domain.
△ Less
Submitted 26 February, 2019;
originally announced February 2019.
-
Mining Best Closed Itemsets for Projection-antimonotonic Constraints in Polynomial Time
Authors:
Aleksey Buzmakov,
Sergei O. Kuznetsov,
Amedeo Napoli
Abstract:
The exponential explosion of the set of patterns is one of the main challenges in pattern mining. This challenge is approached by introducing a constraint for pattern selection. One of the first constraints proposed in pattern mining is support (frequency) of a pattern in a dataset. Frequency is an anti-monotonic function, i.e., given an infrequent pattern, all its superpatterns are not frequent.…
▽ More
The exponential explosion of the set of patterns is one of the main challenges in pattern mining. This challenge is approached by introducing a constraint for pattern selection. One of the first constraints proposed in pattern mining is support (frequency) of a pattern in a dataset. Frequency is an anti-monotonic function, i.e., given an infrequent pattern, all its superpatterns are not frequent. However, many other constraints for pattern selection are neither monotonic nor anti-monotonic, which makes it difficult to generate patterns satisfying these constraints.
In order to deal with nonmonotonic constraints we introduce the notion of "projection antimonotonicity" and SOFIA algorithm that allow generating best patterns for a class of nonmonotonic constraints. Cosine interest, robustness, stability of closed itemsets, and the associated delta-measure are among these constraints. SOFIA starts from light descriptions of transactions in dataset (a small set of items in the case of itemset description) and then iteratively adds more information to these descriptions (more items with indication of tidsets they describe).
△ Less
Submitted 28 March, 2017;
originally announced March 2017.
-
High-resolution investigation of spinal cord and spine
Authors:
I. Bukreeva,
V. Asadchikov,
A. Buzmakov,
V. Grigoryev,
A. Bravin,
A. Cedola
Abstract:
High-resolution non-invasive 3D study of intact spine and spinal cord morphology on the level of complex vascular and neuronal organization is a crucial issue for the development of treatments for the injuries and pathologies of central nervous system (CNS). X-ray phase contrast tomography enables high quality 3D visualization in ex-vivo mouse model of both vascular and neuronal network of the sof…
▽ More
High-resolution non-invasive 3D study of intact spine and spinal cord morphology on the level of complex vascular and neuronal organization is a crucial issue for the development of treatments for the injuries and pathologies of central nervous system (CNS). X-ray phase contrast tomography enables high quality 3D visualization in ex-vivo mouse model of both vascular and neuronal network of the soft spinal cord tissue at the scale from millimeters to hundreds of nanometers without any contrast agents and sectioning. Until now, 3D high resolution visualization of spinal cord mostly has been limited by imaging of organ extracted from vertebral column because high absorbing boney tissue drastically reduces the morphological details of soft tissue in image. However, the extremely destructive procedure of bones removal leads to sample deterioration and, therefore, to the lack of considerable part of information about the object. In this work we present the data analysis procedure to get high resolution and high contrast 3D images of intact mice spinal cord surrounded by vertebras, preserving all richness of micro-details of the spinal cord inhabiting inside. Our results are the first step forward to the difficult way toward the high- resolution investigation of in-vivo model central nervous system.
△ Less
Submitted 27 February, 2017;
originally announced February 2017.
-
SIMEX: Simulation of Experiments at Advanced Light Sources
Authors:
C Fortmann-Grote,
A A Andreev,
R Briggs,
M Bussmann,
A Buzmakov,
M Garten,
A Grund,
A Hübl,
S Hauff,
A Joy,
Z Jurek,
N D Loh,
T Rüter,
L Samoylova,
R Santra,
E A Schneidmiller,
A Sharma,
M Wing,
S Yakubov,
C H Yoon,
M V Yurkov,
B Ziaja,
A P Mancuso
Abstract:
Realistic simulations of experiments at large scale photon facilities, such as optical laser laboratories, synchrotrons, and free electron lasers, are of vital importance for the successful preparation, execution, and analysis of these experiments investigating ever more complex physical systems, e.g. biomolecules, complex materials, and ultra-short lived states of highly excited matter. Tradition…
▽ More
Realistic simulations of experiments at large scale photon facilities, such as optical laser laboratories, synchrotrons, and free electron lasers, are of vital importance for the successful preparation, execution, and analysis of these experiments investigating ever more complex physical systems, e.g. biomolecules, complex materials, and ultra-short lived states of highly excited matter. Traditional photon science modelling takes into account only isolated aspects of an experiment, such as the beam propagation, the photon-matter interaction, or the scattering process, making idealized assumptions about the remaining parts, e.g.\ the source spectrum, temporal structure and coherence properties of the photon beam, or the detector response. In SIMEX, we have implemented a platform for complete start-to-end simulations, following the radiation from the source, through the beam transport optics to the sample or target under investigation, its interaction with and scattering from the sample, and its registration in a photon detector, including a realistic model of the detector response to the radiation. Data analysis tools can be hooked up to the modelling pipeline easily. This allows researchers and facility operators to simulate their experiments and instruments in real life scenarios, identify promising and unattainable regions of the parameter space and ultimately make better use of valuable beamtime.
This work is licensed under the Creative Commons Attribution 3.0 Unported License: http://creativecommons.org/licenses/by/3.0/.
△ Less
Submitted 17 November, 2016; v1 submitted 19 October, 2016;
originally announced October 2016.
-
Revisiting Pattern Structure Projections
Authors:
Aleksey Buzmakov,
Sergei O. Kuznetsov,
Amedeo Napoli
Abstract:
Formal concept analysis (FCA) is a well-founded method for data analysis and has many applications in data mining. Pattern structures is an extension of FCA for dealing with complex data such as sequences or graphs. However the computational complexity of computing with pattern structures is high and projections of pattern structures were introduced for simplifying computation. In this paper we in…
▽ More
Formal concept analysis (FCA) is a well-founded method for data analysis and has many applications in data mining. Pattern structures is an extension of FCA for dealing with complex data such as sequences or graphs. However the computational complexity of computing with pattern structures is high and projections of pattern structures were introduced for simplifying computation. In this paper we introduce o-projections of pattern structures, a generalization of projections which defines a wider class of projections preserving the properties of the original approach. Moreover, we show that o-projections form a semilattice and we discuss the correspondence between o-projections and the representation contexts of o-projected pattern structures.
KEYWORDS: formal concept analysis, pattern structures, representation contexts, projections
△ Less
Submitted 16 June, 2015;
originally announced June 2015.
-
Fast Generation of Best Interval Patterns for Nonmonotonic Constraints
Authors:
Aleksey Buzmakov,
Sergei O. Kuznetsov,
Amedeo Napoli
Abstract:
In pattern mining, the main challenge is the exponential explosion of the set of patterns. Typically, to solve this problem, a constraint for pattern selection is introduced. One of the first constraints proposed in pattern mining is support (frequency) of a pattern in a dataset. Frequency is an anti-monotonic function, i.e., given an infrequent pattern, all its superpatterns are not frequent. How…
▽ More
In pattern mining, the main challenge is the exponential explosion of the set of patterns. Typically, to solve this problem, a constraint for pattern selection is introduced. One of the first constraints proposed in pattern mining is support (frequency) of a pattern in a dataset. Frequency is an anti-monotonic function, i.e., given an infrequent pattern, all its superpatterns are not frequent. However, many other constraints for pattern selection are not (anti-)monotonic, which makes it difficult to generate patterns satisfying these constraints. In this paper we introduce the notion of projection-antimonotonicity and $θ$-$Σøφια$ algorithm that allows efficient generation of the best patterns for some nonmonotonic constraints. In this paper we consider stability and $Δ$-measure, which are nonmonotonic constraints, and apply them to interval tuple datasets. In the experiments, we compute best interval tuple patterns w.r.t. these measures and show the advantage of our approach over postfiltering approaches.
KEYWORDS: Pattern mining, nonmonotonic constraints, interval tuple data
△ Less
Submitted 16 June, 2015; v1 submitted 2 June, 2015;
originally announced June 2015.
-
On mining complex sequential data by means of FCA and pattern structures
Authors:
Aleksey Buzmakov,
Elias Egho,
Nicolas Jay,
Sergei O. Kuznetsov,
Amedeo Napoli,
Chedy Raïssi
Abstract:
Nowadays data sets are available in very complex and heterogeneous ways. Mining of such data collections is essential to support many real-world applications ranging from healthcare to marketing. In this work, we focus on the analysis of "complex" sequential data by means of interesting sequential patterns. We approach the problem using the elegant mathematical framework of Formal Concept Analysis…
▽ More
Nowadays data sets are available in very complex and heterogeneous ways. Mining of such data collections is essential to support many real-world applications ranging from healthcare to marketing. In this work, we focus on the analysis of "complex" sequential data by means of interesting sequential patterns. We approach the problem using the elegant mathematical framework of Formal Concept Analysis (FCA) and its extension based on "pattern structures". Pattern structures are used for mining complex data (such as sequences or graphs) and are based on a subsumption operation, which in our case is defined with respect to the partial order on sequences. We show how pattern structures along with projections (i.e., a data reduction of sequential structures), are able to enumerate more meaningful patterns and increase the computing efficiency of the approach. Finally, we show the applicability of the presented method for discovering and analyzing interesting patient patterns from a French healthcare data set on cancer. The quantitative and qualitative results (with annotations and analysis from a physician) are reported in this use case which is the main motivation for this work.
Keywords: data mining; formal concept analysis; pattern structures; projections; sequences; sequential data.
△ Less
Submitted 9 April, 2015;
originally announced April 2015.