-
Optimal Categorical Attribute Transformation for Granularity Change in Relational Databases for Binary Decision Problems in Educational Data Mining
Authors:
Paulo J. L. Adeodato,
Fábio C. Pereira,
Rosalvo F. Oliveira Neto
Abstract:
This paper presents an approach for transforming data granularity in hierarchical databases for binary decision problems by applying regression to categorical attributes at the lower grain levels. Attributes from a lower hierarchy entity in the relational database have their information content optimized through regression on the categories histogram trained on a small exclusive labelled sample, i…
▽ More
This paper presents an approach for transforming data granularity in hierarchical databases for binary decision problems by applying regression to categorical attributes at the lower grain levels. Attributes from a lower hierarchy entity in the relational database have their information content optimized through regression on the categories histogram trained on a small exclusive labelled sample, instead of the usual mode category of the distribution. The paper validates the approach on a binary decision task for assessing the quality of secondary schools focusing on how logistic regression transforms the students and teachers attributes into school attributes. Experiments were carried out on Brazilian schools public datasets via 10-fold cross-validation comparison of the ranking score produced also by logistic regression. The proposed approach achieved higher performance than the usual distribution mode transformation and equal to the expert weighing approach measured by the maximum Kolmogorov-Smirnov distance and the area under the ROC curve at 0.01 significance level.
△ Less
Submitted 28 February, 2017;
originally announced February 2017.
-
On the equivalence between Kolmogorov-Smirnov and ROC curve metrics for binary classification
Authors:
Paulo J. L. Adeodato,
Sílvio B. Melo
Abstract:
Binary decisions are very common in artificial intelligence. Applying a threshold on the continuous score gives the human decider the power to control the operating point to separate the two classes. The classifier,s discriminating power is measured along the continuous range of the score by the Area Under the ROC curve (AUC_ROC) in most application fields. Only finances uses the poor single point…
▽ More
Binary decisions are very common in artificial intelligence. Applying a threshold on the continuous score gives the human decider the power to control the operating point to separate the two classes. The classifier,s discriminating power is measured along the continuous range of the score by the Area Under the ROC curve (AUC_ROC) in most application fields. Only finances uses the poor single point metric maximum Kolmogorov-Smirnov (KS) distance. This paper proposes the Area Under the KS curve (AUC_KS) for performance assessment and proves AUC_ROC = 0.5 + AUC_KS, as a simpler way to calculate the AUC_ROC. That is even more important for ROC averaging in ensembles of classifiers or n fold cross-validation. The proof is geometrically inspired on rotating all KS curve to make it lie on the top of the ROC chance diagonal. On the practical side, the independent variable on the abscissa on the KS curve simplifies the calculation of the AUC_ROC. On the theoretical side, this research gives insights on probabilistic interpretations of classifiers assessment and integrates the existing body of knowledge of the information theoretical ROC approach with the proposed statistical approach based on the thoroughly known KS distribution.
△ Less
Submitted 1 June, 2016;
originally announced June 2016.
-
A Data Mining Approach to Solve the Goal Scoring Problem
Authors:
Renato Oliveira,
Paulo Adeodato,
Arthur Carvalho,
Icamaan Viegas,
Christian Diego,
Tsang Ing-Ren
Abstract:
In soccer, scoring goals is a fundamental objective which depends on many conditions and constraints. Considering the RoboCup soccer 2D-simulator, this paper presents a data mining-based decision system to identify the best time and direction to kick the ball towards the goal to maximize the overall chances of scoring during a simulated soccer match. Following the CRISP-DM methodology, data for mo…
▽ More
In soccer, scoring goals is a fundamental objective which depends on many conditions and constraints. Considering the RoboCup soccer 2D-simulator, this paper presents a data mining-based decision system to identify the best time and direction to kick the ball towards the goal to maximize the overall chances of scoring during a simulated soccer match. Following the CRISP-DM methodology, data for modeling were extracted from matches of major international tournaments (10691 kicks), knowledge about soccer was embedded via transformation of variables and a Multilayer Perceptron was used to estimate the scoring chance. Experimental performance assessment to compare this approach against previous LDA-based approach was conducted from 100 matches. Several statistical metrics were used to analyze the performance of the system and the results showed an increase of 7.7% in the number of kicks, producing an overall increase of 78% in the number of goals scored.
△ Less
Submitted 26 June, 2013; v1 submitted 21 May, 2013;
originally announced May 2013.