-
Reducing the climate impact of data portals: a case study
Authors:
Noah Gießing,
Madhurima Deb,
Ankit Satpute,
Moritz Schubotz,
Olaf Teschke
Abstract:
The carbon footprint share of the information and communication technology (ICT) sector has steadily increased in the past decade and is predicted to make up as much as 23 \% of global emissions in 2030. This shows a pressing need for developers, including the information retrieval community, to make their code more energy-efficient. In this project proposal, we discuss techniques to reduce the en…
▽ More
The carbon footprint share of the information and communication technology (ICT) sector has steadily increased in the past decade and is predicted to make up as much as 23 \% of global emissions in 2030. This shows a pressing need for developers, including the information retrieval community, to make their code more energy-efficient. In this project proposal, we discuss techniques to reduce the energy footprint of the MaRDI (Mathematical Research Data Initiative) Portal, a MediaWiki-based knowledge base. In future work, we plan to implement these changes and provide concrete measurements on the gain in energy efficiency. Researchers develo** similar knowledge bases can adapt our measures to reduce their environmental footprint. In this way, we are working on mitigating the climate impact of Information Retrieval research.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange
Authors:
Ankit Satpute,
Noah Giessing,
Andre Greiner-Petter,
Moritz Schubotz,
Olaf Teschke,
Akiko Aizawa,
Bela Gipp
Abstract:
Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks, often achieving performances that surpass those of humans. Despite these advancements, the domain of mathematics presents a distinctive challenge, primarily due to its specialized structure and the precision it demands. In this study, we adopted a two-step approach for investigating the profi…
▽ More
Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks, often achieving performances that surpass those of humans. Despite these advancements, the domain of mathematics presents a distinctive challenge, primarily due to its specialized structure and the precision it demands. In this study, we adopted a two-step approach for investigating the proficiency of LLMs in answering mathematical questions. First, we employ the most effective LLMs, as identified by their performance on math question-answer benchmarks, to generate answers to 78 questions from the Math Stack Exchange (MSE). Second, a case analysis is conducted on the LLM that showed the highest performance, focusing on the quality and accuracy of its answers through manual evaluation. We found that GPT-4 performs best (nDCG of 0.48 and P@10 of 0.37) amongst existing LLMs fine-tuned for answering mathematics questions and outperforms the current best approach on ArqMATH3 Task1, considering P@10. Our Case analysis indicates that while the GPT-4 can generate relevant responses in certain instances, it does not consistently answer all questions accurately. This paper explores the current limitations of LLMs in navigating complex mathematical problem-solving. Through case analysis, we shed light on the gaps in LLM capabilities within mathematics, thereby setting the stage for future research and advancements in AI-driven mathematical reasoning. We make our code and findings publicly available for research: \url{https://github.com/gipplab/LLM-Investig-MathStackExchange}
△ Less
Submitted 30 March, 2024;
originally announced April 2024.
-
Taxonomy of Mathematical Plagiarism
Authors:
Ankit Satpute,
Andre Greiner-Petter,
Noah Gießing,
Isabel Beckenbach,
Moritz Schubotz,
Olaf Teschke,
Akiko Aizawa,
Bela Gipp
Abstract:
Plagiarism is a pressing concern, even more so with the availability of large language models. Existing plagiarism detection systems reliably find copied and moderately reworded text but fail for idea plagiarism, especially in mathematical science, which heavily uses formal mathematical notation. We make two contributions. First, we establish a taxonomy of mathematical content reuse by annotating…
▽ More
Plagiarism is a pressing concern, even more so with the availability of large language models. Existing plagiarism detection systems reliably find copied and moderately reworded text but fail for idea plagiarism, especially in mathematical science, which heavily uses formal mathematical notation. We make two contributions. First, we establish a taxonomy of mathematical content reuse by annotating potentially plagiarised 122 scientific document pairs. Second, we analyze the best-performing approaches to detect plagiarism and mathematical content similarity on the newly established taxonomy. We found that the best-performing methods for plagiarism and math content similarity achieve an overall detection score (PlagDet) of 0.06 and 0.16, respectively. The best-performing methods failed to detect most cases from all seven newly established math similarity types. Outlined contributions will benefit research in plagiarism detection systems, recommender systems, question-answering systems, and search engines. We make our experiment's code and annotated dataset available to the community: https://github.com/gipplab/Taxonomy-of-Mathematical-Plagiarism
△ Less
Submitted 31 May, 2024; v1 submitted 30 January, 2024;
originally announced January 2024.
-
The extension of zbMATH Open by arXiv preprints
Authors:
Isabel Beckenbach,
Klaus Hulek,
Olaf Teschke
Abstract:
zbMATH Open has started a new feature -- relevant preprints posted at arXiv will also be displayed in the database. In this article we introduce this new feature and the underlying editorial policy. We also describe some of the technical issues involved and discuss the challenges this presents for future developments.
zbMATH Open has started a new feature -- relevant preprints posted at arXiv will also be displayed in the database. In this article we introduce this new feature and the underlying editorial policy. We also describe some of the technical issues involved and discuss the challenges this presents for future developments.
△ Less
Submitted 16 January, 2024;
originally announced January 2024.
-
Bravo MaRDI: A Wikibase Powered Knowledge Graph on Mathematics
Authors:
Moritz Schubotz,
Eloi Ferrer,
Johannes Stegmüller,
Daniel Mietchen,
Olaf Teschke,
Larissa Pusch,
Tim OF Conrad
Abstract:
Mathematical world knowledge is a fundamental component of Wikidata. However, to date, no expertly curated knowledge graph has focused specifically on contemporary mathematics. Addressing this gap, the Mathematical Research Data Initiative (MaRDI) has developed a comprehensive knowledge graph that links multimodal research data in mathematics. This encompasses traditional research data items like…
▽ More
Mathematical world knowledge is a fundamental component of Wikidata. However, to date, no expertly curated knowledge graph has focused specifically on contemporary mathematics. Addressing this gap, the Mathematical Research Data Initiative (MaRDI) has developed a comprehensive knowledge graph that links multimodal research data in mathematics. This encompasses traditional research data items like datasets, software, and publications and includes semantically advanced objects such as mathematical formulas and hypotheses. This paper details the abilities of the MaRDI knowledge graph, which is based on Wikibase, leading up to its inaugural public release, codenamed Bravo, available on https://portal.mardi4nfdi.de.
△ Less
Submitted 20 September, 2023;
originally announced September 2023.
-
TEIMMA: The First Content Reuse Annotator for Text, Images, and Math
Authors:
Ankit Satpute,
André Greiner-Petter,
Moritz Schubotz,
Norman Meuschke,
Akiko Aizawa,
Olaf Teschke,
Bela Gipp
Abstract:
This demo paper presents the first tool to annotate the reuse of text, images, and mathematical formulae in a document pair -- TEIMMA. Annotating content reuse is particularly useful to develop plagiarism detection algorithms. Real-world content reuse is often obfuscated, which makes it challenging to identify such cases. TEIMMA allows entering the obfuscation type to enable novel classifications…
▽ More
This demo paper presents the first tool to annotate the reuse of text, images, and mathematical formulae in a document pair -- TEIMMA. Annotating content reuse is particularly useful to develop plagiarism detection algorithms. Real-world content reuse is often obfuscated, which makes it challenging to identify such cases. TEIMMA allows entering the obfuscation type to enable novel classifications for confirmed cases of plagiarism. It enables recording different reuse types for text, images, and mathematical formulae in HTML and supports users by visualizing the content reuse in a document pair using similarity detection methods for text and math.
△ Less
Submitted 13 June, 2023; v1 submitted 22 May, 2023;
originally announced May 2023.
-
10 Years Later: The Mathematics Subject Classification and Linked Open Data
Authors:
Susanne Arndt,
Patrick Ion,
Mila Runnwerth,
Moritz Schubotz,
Olaf Teschke
Abstract:
Ten years ago, the Mathematics Subject Classification MSC 2010 was released, and a corresponding machine-readable Linked Open Data collection was published using the Simple Knowledge Organization System (SKOS). Now, the new MSC 2020 is out.
This paper recaps the last ten years of working on machine-readable MSC data and presents the new machine-readable MSC 2020. We describe the processing requi…
▽ More
Ten years ago, the Mathematics Subject Classification MSC 2010 was released, and a corresponding machine-readable Linked Open Data collection was published using the Simple Knowledge Organization System (SKOS). Now, the new MSC 2020 is out.
This paper recaps the last ten years of working on machine-readable MSC data and presents the new machine-readable MSC 2020. We describe the processing required to convert the version of record, as agreed by the editors of zbMATH and Mathematical Reviews, into the Linked Open Data form we call MSC2020-SKOS. The new form includes explicit marking of the changes from 2010 to 2020, some translations of English code descriptions into Chinese, Italian, and Russian, and extra material relating MSC to other mathematics classification efforts. We also outline future potential uses for MSC2020-SKOS in semantic indexing and sketch its embedding in a larger vision of scientific research data.
△ Less
Submitted 2 August, 2021; v1 submitted 29 July, 2021;
originally announced July 2021.
-
zbMATH Open: API Solutions and Research Challenges
Authors:
Matteo Petrera,
Dennis Trautwein,
Isabel Beckenbach,
Dariush Ehsani,
Fabian Mueller,
Olaf Teschke,
Bela Gipp,
Moritz Schubotz
Abstract:
We present zbMATH Open, the most comprehensive collection of reviews and bibliographic metadata of scholarly literature in mathematics. Besides our website https://zbMATH.org which is openly accessible since the beginning of this year, we provide API endpoints to offer our data. The API improves interoperability with others, i.e., digital libraries, and allows using our data for research purposes.…
▽ More
We present zbMATH Open, the most comprehensive collection of reviews and bibliographic metadata of scholarly literature in mathematics. Besides our website https://zbMATH.org which is openly accessible since the beginning of this year, we provide API endpoints to offer our data. The API improves interoperability with others, i.e., digital libraries, and allows using our data for research purposes. In this article, we
(1) illustrate the current and future overview of the services offered by zbMATH;
(2) present the initial version of the zbMATH links API;
(3) analyze potentials and limitations of the links API based on the example of the NIST Digital Library of Mathematical Functions;
(4) and finally, present the zbMATH Open dataset as a research resource and discuss connected open research problems.
△ Less
Submitted 23 June, 2021; v1 submitted 8 June, 2021;
originally announced June 2021.
-
ARQMath Lab: An Incubator for Semantic Formula Search in zbMATH Open?
Authors:
Philipp Scharpf,
Moritz Schubotz,
Andre Greiner-Petter,
Malte Ostendorff,
Olaf Teschke,
Bela Gipp
Abstract:
The zbMATH database contains more than 4 million bibliographic entries. We aim to provide easy access to these entries. Therefore, we maintain different index structures, including a formula index. To optimize the findability of the entries in our database, we continuously investigate new approaches to satisfy the information needs of our users. We believe that the findings from the ARQMath evalua…
▽ More
The zbMATH database contains more than 4 million bibliographic entries. We aim to provide easy access to these entries. Therefore, we maintain different index structures, including a formula index. To optimize the findability of the entries in our database, we continuously investigate new approaches to satisfy the information needs of our users. We believe that the findings from the ARQMath evaluation will generate new insights into which index structures are most suitable to satisfy mathematical information needs. Search engines, recommender systems, plagiarism checking software, and many other added-value services acting on databases such as the arXiv and zbMATH need to combine natural and formula language. One initial approach to address this challenge is to enrich the mostly unstructured document data via Entity Linking. The ARQMath Task at CLEF 2020 aims to tackle the problem of linking newly posted questions from Math Stack Exchange (MSE) to existing ones that were already answered by the community. To deeply understand MSE information needs, answer-, and formula types, we performed manual runs for tasks 1 and 2. Furthermore, we explored several formula retrieval methods: For task 2, such as fuzzy string search, k-nearest neighbors, and our recently introduced approach to retrieve Mathematical Objects of Interest (MOI) with textual search queries. The task results show that neither our automated methods nor our manual runs archived good scores in the competition. However, the perceived quality of the hits returned by the MOI search particularly motivates us to conduct further research about MOI.
△ Less
Submitted 10 December, 2020; v1 submitted 4 December, 2020;
originally announced December 2020.
-
AutoMSC: Automatic Assignment of Mathematics Subject Classification Labels
Authors:
Moritz Schubotz,
Philipp Scharpf,
Olaf Teschke,
Andreas Kuehnemund,
Corinna Breitinger,
Bela Gipp
Abstract:
Authors of research papers in the fields of mathematics, and other math-heavy disciplines commonly employ the Mathematics Subject Classification (MSC) scheme to search for relevant literature. The MSC is a hierarchical alphanumerical classification scheme that allows librarians to specify one or multiple codes for publications. Digital Libraries in Mathematics, as well as reviewing services, such…
▽ More
Authors of research papers in the fields of mathematics, and other math-heavy disciplines commonly employ the Mathematics Subject Classification (MSC) scheme to search for relevant literature. The MSC is a hierarchical alphanumerical classification scheme that allows librarians to specify one or multiple codes for publications. Digital Libraries in Mathematics, as well as reviewing services, such as zbMATH and Mathematical Reviews (MR) rely on these MSC labels in their workflows to organize the abstracting and reviewing process. Especially, the coarse-grained classification determines the subject editor who is responsible for the actual reviewing process.
In this paper, we investigate the feasibility of automatically assigning a coarse-grained primary classification using the MSC scheme, by regarding the problem as a multi-class classification machine learning task. We find that our method achieves an (F_1)-score of over 77%, which is remarkably close to the agreement of zbMATH and MR ((F_1)-score of 81%). Moreover, we find that the method's confidence score allows for reducing the effort by 86% compared to the manual coarse-grained classification effort while maintaining a precision of 81% for automatically classified articles.
△ Less
Submitted 9 November, 2020; v1 submitted 25 May, 2020;
originally announced May 2020.
-
Mathematical Formulae in Wikimedia Projects 2020
Authors:
Moritz Schubotz,
André Greiner-Petter,
Norman Meuschke,
Olaf Teschke,
Bela Gipp
Abstract:
This poster summarizes our contributions to Wikimedia's processing pipeline for mathematical formulae. We describe how we have supported the transition from rendering formulae as course-grained PNG images in 2001 to providing modern semantically enriched language-independent MathML formulae in 2020. Additionally, we describe our plans to improve the accessibility and discoverability of mathematica…
▽ More
This poster summarizes our contributions to Wikimedia's processing pipeline for mathematical formulae. We describe how we have supported the transition from rendering formulae as course-grained PNG images in 2001 to providing modern semantically enriched language-independent MathML formulae in 2020. Additionally, we describe our plans to improve the accessibility and discoverability of mathematical knowledge in Wikimedia projects further.
△ Less
Submitted 6 May, 2020; v1 submitted 20 March, 2020;
originally announced March 2020.
-
Forms of Plagiarism in Digital Mathematical Libraries
Authors:
Moritz Schubotz,
Olaf Teschke,
Vincent Stange,
Norman Meuschke,
Bela Gipp
Abstract:
We report on an exploratory analysis of the forms of plagiarism observable in mathematical publications, which we identified by investigating editorial notes from zbMATH. While most cases we encountered were simple copies of earlier work, we also identified several forms of disguised plagiarism. We investigated 11 cases in detail and evaluate how current plagiarism detection systems perform in ide…
▽ More
We report on an exploratory analysis of the forms of plagiarism observable in mathematical publications, which we identified by investigating editorial notes from zbMATH. While most cases we encountered were simple copies of earlier work, we also identified several forms of disguised plagiarism. We investigated 11 cases in detail and evaluate how current plagiarism detection systems perform in identifying these cases. Moreover, we describe the steps required to discover these and potentially undiscovered cases in the future.
△ Less
Submitted 9 September, 2019; v1 submitted 8 May, 2019;
originally announced May 2019.