Search | arXiv e-print repository

Feature Synergy, Redundancy, and Independence in Global Model Explanations using SHAP Vector Decomposition

Authors: Jan Ittner, Lukasz Bolikowski, Konstantin Hemker, Ricardo Kennedy

Abstract: We offer a new formalism for global explanations of pairwise feature dependencies and interactions in supervised models. Building upon SHAP values and SHAP interaction values, our approach decomposes feature contributions into synergistic, redundant and independent components (S-R-I decomposition of SHAP vectors). We propose a geometric interpretation of the components and formally prove its basic… ▽ More We offer a new formalism for global explanations of pairwise feature dependencies and interactions in supervised models. Building upon SHAP values and SHAP interaction values, our approach decomposes feature contributions into synergistic, redundant and independent components (S-R-I decomposition of SHAP vectors). We propose a geometric interpretation of the components and formally prove its basic properties. Finally, we demonstrate the utility of synergy, redundancy and independence by applying them to a constructed data set and model. △ Less

Submitted 26 July, 2021; originally announced July 2021.

Comments: 7 pages, 2 figures

arXiv:1309.0326 [pdf, other]

doi 10.1007/978-3-319-08425-1_3

Tagging Scientific Publications using Wikipedia and Natural Language Processing Tools. Comparison on the ArXiv Dataset

Authors: Michał Łopuszyński, Łukasz Bolikowski

Abstract: In this work, we compare two simple methods of tagging scientific publications with labels reflecting their content. As a first source of labels Wikipedia is employed, second label set is constructed from the noun phrases occurring in the analyzed corpus. We examine the statistical properties and the effectiveness of both approaches on the dataset consisting of abstracts from 0.7 million of scient… ▽ More In this work, we compare two simple methods of tagging scientific publications with labels reflecting their content. As a first source of labels Wikipedia is employed, second label set is constructed from the noun phrases occurring in the analyzed corpus. We examine the statistical properties and the effectiveness of both approaches on the dataset consisting of abstracts from 0.7 million of scientific documents deposited in the ArXiv preprint collection. We believe that obtained tags can be later on applied as useful document features in various machine learning tasks (document similarity, clustering, topic modelling, etc.). △ Less

Submitted 3 November, 2014; v1 submitted 2 September, 2013; originally announced September 2013.

Journal ref: Communications in Computer and Information Science Volume 416, Springer 2014, pp 16-27

arXiv:1303.6906 [pdf, ps, other]

Large scale citation matching using Apache Hadoop

Authors: Mateusz Fedoryszak, Dominika Tkaczyk, Łukasz Bolikowski

Abstract: During the process of citation matching links from bibliography entries to referenced publications are created. Such links are indicators of topical similarity between linked texts, are used in assessing the impact of the referenced document and improve navigation in the user interfaces of digital libraries. In this paper we present a citation matching method and show how to scale it up to handle… ▽ More During the process of citation matching links from bibliography entries to referenced publications are created. Such links are indicators of topical similarity between linked texts, are used in assessing the impact of the referenced document and improve navigation in the user interfaces of digital libraries. In this paper we present a citation matching method and show how to scale it up to handle great amounts of data using appropriate indexing and a MapReduce paradigm in the Hadoop environment. △ Less

Submitted 26 March, 2013; originally announced March 2013.

Comments: 11 pages, 4 figures

ACM Class: H.3.3

arXiv:1303.5367 [pdf, ps, other]

Taming the zoo - about algorithms implementation in the ecosystem of Apache Hadoop

Authors: Piotr Jan Dendek, Artur Czeczko, Mateusz Fedoryszak, Adam Kawa, Piotr Wendykier, Lukasz Bolikowski

Abstract: Content Analysis System (CoAnSys) is a research framework for mining scientific publications using Apache Hadoop. This article describes the algorithms currently implemented in CoAnSys including classification, categorization and citation matching of scientific publications. The size of the input data classifies these algorithms in the range of big data problems, which can be efficiently solved on… ▽ More Content Analysis System (CoAnSys) is a research framework for mining scientific publications using Apache Hadoop. This article describes the algorithms currently implemented in CoAnSys including classification, categorization and citation matching of scientific publications. The size of the input data classifies these algorithms in the range of big data problems, which can be efficiently solved on Hadoop clusters. △ Less

Submitted 16 March, 2014; v1 submitted 21 March, 2013; originally announced March 2013.

Comments: This paper (with changed content) appeared under the title "Content Analysis of Scientific Articles in Apache Hadoop Ecosystem" in "Intelligent Tools for Building a Scientific Information Platform: From Research to Implementation", "Studies in Computational Intelligence", Volume 541, 2014, http://link.springer.com/book/10.1007/978-3-319-04714-0

ACM Class: H.3.7

arXiv:1303.5234 [pdf, ps, other]

How to perform research in Hadoop environment not losing mental equilibrium - case study

Authors: Piotr Jan Dendek, Artur Czeczko, Mateusz Fedoryszak, Adam Kawa, Piotr Wendykier, Lukasz Bolikowski

Abstract: Conducting a research in an efficient, repetitive, evaluable, but also convenient (in terms of development) way has always been a challenge. To satisfy those requirements in a long term and simultaneously minimize costs of the software engineering process, one has to follow a certain set of guidelines. This article describes such guidelines based on the research environment called Content Analysis… ▽ More Conducting a research in an efficient, repetitive, evaluable, but also convenient (in terms of development) way has always been a challenge. To satisfy those requirements in a long term and simultaneously minimize costs of the software engineering process, one has to follow a certain set of guidelines. This article describes such guidelines based on the research environment called Content Analysis System (CoAnSys) created in the Center for Open Science (CeON). Best practices and tools for working in the Apache Hadoop environment, as well as the process of establishing these rules are portrayed. △ Less

Submitted 16 March, 2014; v1 submitted 21 March, 2013; originally announced March 2013.

Comments: This paper (with changed content) appeared under the title "Chrum: The Tool for Convenient Generation of Apache Oozie Workflows" in "Intelligent Tools for Building a Scientific Information Platform: From Research to Implementation", "Studies in Computational Intelligence", Volume 541, 2014, http://link.springer.com/book/10.1007/978-3-319-04714-0

ACM Class: H.3.7

arXiv:0904.0564 [pdf, ps, other]

Scale-free topology of the interlanguage links in Wikipedia

Authors: Łukasz Bolikowski

Abstract: The interlanguage links in Wikipedia connect pages on the same subject written in different languages. In theory, each connected component should be a clique and cover one topic. However, incoherent edits and obvious mistakes result in topic coalescence, yielding a non-trivial topology that is studied in this paper. We show that the component size distribution obeys the power law, and we explain… ▽ More The interlanguage links in Wikipedia connect pages on the same subject written in different languages. In theory, each connected component should be a clique and cover one topic. However, incoherent edits and obvious mistakes result in topic coalescence, yielding a non-trivial topology that is studied in this paper. We show that the component size distribution obeys the power law, and we explain anomalies in the distribution as results of certain edit conventions. Next, we propose a method of filtering out the cliques and study basic properties of the resulting skeleton, which turns out to be scale-free. △ Less

Submitted 6 April, 2009; v1 submitted 3 April, 2009; originally announced April 2009.

Comments: 4 pages, 4 figures, 1 table; minor mistakes corrected, some clarifications; submitted to Phys. Rev. E

Showing 1–6 of 6 results for author: Bolikowski, L