Search | arXiv e-print repository

Reducing the climate impact of data portals: a case study

Authors: Noah Gießing, Madhurima Deb, Ankit Satpute, Moritz Schubotz, Olaf Teschke

Abstract: The carbon footprint share of the information and communication technology (ICT) sector has steadily increased in the past decade and is predicted to make up as much as 23 \% of global emissions in 2030. This shows a pressing need for developers, including the information retrieval community, to make their code more energy-efficient. In this project proposal, we discuss techniques to reduce the en… ▽ More The carbon footprint share of the information and communication technology (ICT) sector has steadily increased in the past decade and is predicted to make up as much as 23 \% of global emissions in 2030. This shows a pressing need for developers, including the information retrieval community, to make their code more energy-efficient. In this project proposal, we discuss techniques to reduce the energy footprint of the MaRDI (Mathematical Research Data Initiative) Portal, a MediaWiki-based knowledge base. In future work, we plan to implement these changes and provide concrete measurements on the gain in energy efficiency. Researchers develo** similar knowledge bases can adapt our measures to reduce their environmental footprint. In this way, we are working on mitigating the climate impact of Information Retrieval research. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: 4 pages

arXiv:2404.00344 [pdf, other]

Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange

Authors: Ankit Satpute, Noah Giessing, Andre Greiner-Petter, Moritz Schubotz, Olaf Teschke, Akiko Aizawa, Bela Gipp

Abstract: Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks, often achieving performances that surpass those of humans. Despite these advancements, the domain of mathematics presents a distinctive challenge, primarily due to its specialized structure and the precision it demands. In this study, we adopted a two-step approach for investigating the profi… ▽ More Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks, often achieving performances that surpass those of humans. Despite these advancements, the domain of mathematics presents a distinctive challenge, primarily due to its specialized structure and the precision it demands. In this study, we adopted a two-step approach for investigating the proficiency of LLMs in answering mathematical questions. First, we employ the most effective LLMs, as identified by their performance on math question-answer benchmarks, to generate answers to 78 questions from the Math Stack Exchange (MSE). Second, a case analysis is conducted on the LLM that showed the highest performance, focusing on the quality and accuracy of its answers through manual evaluation. We found that GPT-4 performs best (nDCG of 0.48 and P@10 of 0.37) amongst existing LLMs fine-tuned for answering mathematics questions and outperforms the current best approach on ArqMATH3 Task1, considering P@10. Our Case analysis indicates that while the GPT-4 can generate relevant responses in certain instances, it does not consistently answer all questions accurately. This paper explores the current limitations of LLMs in navigating complex mathematical problem-solving. Through case analysis, we shed light on the gaps in LLM capabilities within mathematics, thereby setting the stage for future research and advancements in AI-driven mathematical reasoning. We make our code and findings publicly available for research: \url{https://github.com/gipplab/LLM-Investig-MathStackExchange} △ Less

Submitted 30 March, 2024; originally announced April 2024.

Comments: Accepted for publication at the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) July 14--18, 2024, Washington D.C.,USA

arXiv:2401.16969 [pdf, ps, other]

doi 10.1007/978-3-031-56066-8_2

Taxonomy of Mathematical Plagiarism

Authors: Ankit Satpute, Andre Greiner-Petter, Noah Gießing, Isabel Beckenbach, Moritz Schubotz, Olaf Teschke, Akiko Aizawa, Bela Gipp

Abstract: Plagiarism is a pressing concern, even more so with the availability of large language models. Existing plagiarism detection systems reliably find copied and moderately reworded text but fail for idea plagiarism, especially in mathematical science, which heavily uses formal mathematical notation. We make two contributions. First, we establish a taxonomy of mathematical content reuse by annotating… ▽ More Plagiarism is a pressing concern, even more so with the availability of large language models. Existing plagiarism detection systems reliably find copied and moderately reworded text but fail for idea plagiarism, especially in mathematical science, which heavily uses formal mathematical notation. We make two contributions. First, we establish a taxonomy of mathematical content reuse by annotating potentially plagiarised 122 scientific document pairs. Second, we analyze the best-performing approaches to detect plagiarism and mathematical content similarity on the newly established taxonomy. We found that the best-performing methods for plagiarism and math content similarity achieve an overall detection score (PlagDet) of 0.06 and 0.16, respectively. The best-performing methods failed to detect most cases from all seven newly established math similarity types. Outlined contributions will benefit research in plagiarism detection systems, recommender systems, question-answering systems, and search engines. We make our experiment's code and annotated dataset available to the community: https://github.com/gipplab/Taxonomy-of-Mathematical-Plagiarism △ Less

Submitted 31 May, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

Comments: 46th European Conference on Information Retrieval (ECIR)

arXiv:2401.08297 [pdf, other]

The extension of zbMATH Open by arXiv preprints

Authors: Isabel Beckenbach, Klaus Hulek, Olaf Teschke

Abstract: zbMATH Open has started a new feature -- relevant preprints posted at arXiv will also be displayed in the database. In this article we introduce this new feature and the underlying editorial policy. We also describe some of the technical issues involved and discuss the challenges this presents for future developments. zbMATH Open has started a new feature -- relevant preprints posted at arXiv will also be displayed in the database. In this article we introduce this new feature and the underlying editorial policy. We also describe some of the technical issues involved and discuss the challenges this presents for future developments. △ Less

Submitted 16 January, 2024; originally announced January 2024.

MSC Class: 68V35

arXiv:2309.11484 [pdf, other]

Bravo MaRDI: A Wikibase Powered Knowledge Graph on Mathematics

Authors: Moritz Schubotz, Eloi Ferrer, Johannes Stegmüller, Daniel Mietchen, Olaf Teschke, Larissa Pusch, Tim OF Conrad

Abstract: Mathematical world knowledge is a fundamental component of Wikidata. However, to date, no expertly curated knowledge graph has focused specifically on contemporary mathematics. Addressing this gap, the Mathematical Research Data Initiative (MaRDI) has developed a comprehensive knowledge graph that links multimodal research data in mathematics. This encompasses traditional research data items like… ▽ More Mathematical world knowledge is a fundamental component of Wikidata. However, to date, no expertly curated knowledge graph has focused specifically on contemporary mathematics. Addressing this gap, the Mathematical Research Data Initiative (MaRDI) has developed a comprehensive knowledge graph that links multimodal research data in mathematics. This encompasses traditional research data items like datasets, software, and publications and includes semantically advanced objects such as mathematical formulas and hypotheses. This paper details the abilities of the MaRDI knowledge graph, which is based on Wikibase, leading up to its inaugural public release, codenamed Bravo, available on https://portal.mardi4nfdi.de. △ Less

Submitted 20 September, 2023; originally announced September 2023.

Comments: Accepted at Wikidata'23: Wikidata workshop at ISWC 2023

arXiv:2305.13193 [pdf, other]

TEIMMA: The First Content Reuse Annotator for Text, Images, and Math

Authors: Ankit Satpute, André Greiner-Petter, Moritz Schubotz, Norman Meuschke, Akiko Aizawa, Olaf Teschke, Bela Gipp

Abstract: This demo paper presents the first tool to annotate the reuse of text, images, and mathematical formulae in a document pair -- TEIMMA. Annotating content reuse is particularly useful to develop plagiarism detection algorithms. Real-world content reuse is often obfuscated, which makes it challenging to identify such cases. TEIMMA allows entering the obfuscation type to enable novel classifications… ▽ More This demo paper presents the first tool to annotate the reuse of text, images, and mathematical formulae in a document pair -- TEIMMA. Annotating content reuse is particularly useful to develop plagiarism detection algorithms. Real-world content reuse is often obfuscated, which makes it challenging to identify such cases. TEIMMA allows entering the obfuscation type to enable novel classifications for confirmed cases of plagiarism. It enables recording different reuse types for text, images, and mathematical formulae in HTML and supports users by visualizing the content reuse in a document pair using similarity detection methods for text and math. △ Less

Submitted 13 June, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

arXiv:2107.13877 [pdf, ps, other]

10 Years Later: The Mathematics Subject Classification and Linked Open Data

Authors: Susanne Arndt, Patrick Ion, Mila Runnwerth, Moritz Schubotz, Olaf Teschke

Abstract: Ten years ago, the Mathematics Subject Classification MSC 2010 was released, and a corresponding machine-readable Linked Open Data collection was published using the Simple Knowledge Organization System (SKOS). Now, the new MSC 2020 is out. This paper recaps the last ten years of working on machine-readable MSC data and presents the new machine-readable MSC 2020. We describe the processing requi… ▽ More Ten years ago, the Mathematics Subject Classification MSC 2010 was released, and a corresponding machine-readable Linked Open Data collection was published using the Simple Knowledge Organization System (SKOS). Now, the new MSC 2020 is out. This paper recaps the last ten years of working on machine-readable MSC data and presents the new machine-readable MSC 2020. We describe the processing required to convert the version of record, as agreed by the editors of zbMATH and Mathematical Reviews, into the Linked Open Data form we call MSC2020-SKOS. The new form includes explicit marking of the changes from 2010 to 2020, some translations of English code descriptions into Chinese, Italian, and Russian, and extra material relating MSC to other mathematics classification efforts. We also outline future potential uses for MSC2020-SKOS in semantic indexing and sketch its embedding in a larger vision of scientific research data. △ Less

Submitted 2 August, 2021; v1 submitted 29 July, 2021; originally announced July 2021.

Comments: Extended version of the CICM article

MSC Class: 00-01 ACM Class: G.m; E.m

arXiv:2106.04664 [pdf, other]

zbMATH Open: API Solutions and Research Challenges

Authors: Matteo Petrera, Dennis Trautwein, Isabel Beckenbach, Dariush Ehsani, Fabian Mueller, Olaf Teschke, Bela Gipp, Moritz Schubotz

Abstract: We present zbMATH Open, the most comprehensive collection of reviews and bibliographic metadata of scholarly literature in mathematics. Besides our website https://zbMATH.org which is openly accessible since the beginning of this year, we provide API endpoints to offer our data. The API improves interoperability with others, i.e., digital libraries, and allows using our data for research purposes.… ▽ More We present zbMATH Open, the most comprehensive collection of reviews and bibliographic metadata of scholarly literature in mathematics. Besides our website https://zbMATH.org which is openly accessible since the beginning of this year, we provide API endpoints to offer our data. The API improves interoperability with others, i.e., digital libraries, and allows using our data for research purposes. In this article, we (1) illustrate the current and future overview of the services offered by zbMATH; (2) present the initial version of the zbMATH links API; (3) analyze potentials and limitations of the links API based on the example of the NIST Digital Library of Mathematical Functions; (4) and finally, present the zbMATH Open dataset as a research resource and discuss connected open research problems. △ Less

Submitted 23 June, 2021; v1 submitted 8 June, 2021; originally announced June 2021.

arXiv:2012.02413 [pdf]

ARQMath Lab: An Incubator for Semantic Formula Search in zbMATH Open?

Authors: Philipp Scharpf, Moritz Schubotz, Andre Greiner-Petter, Malte Ostendorff, Olaf Teschke, Bela Gipp

Abstract: The zbMATH database contains more than 4 million bibliographic entries. We aim to provide easy access to these entries. Therefore, we maintain different index structures, including a formula index. To optimize the findability of the entries in our database, we continuously investigate new approaches to satisfy the information needs of our users. We believe that the findings from the ARQMath evalua… ▽ More The zbMATH database contains more than 4 million bibliographic entries. We aim to provide easy access to these entries. Therefore, we maintain different index structures, including a formula index. To optimize the findability of the entries in our database, we continuously investigate new approaches to satisfy the information needs of our users. We believe that the findings from the ARQMath evaluation will generate new insights into which index structures are most suitable to satisfy mathematical information needs. Search engines, recommender systems, plagiarism checking software, and many other added-value services acting on databases such as the arXiv and zbMATH need to combine natural and formula language. One initial approach to address this challenge is to enrich the mostly unstructured document data via Entity Linking. The ARQMath Task at CLEF 2020 aims to tackle the problem of linking newly posted questions from Math Stack Exchange (MSE) to existing ones that were already answered by the community. To deeply understand MSE information needs, answer-, and formula types, we performed manual runs for tasks 1 and 2. Furthermore, we explored several formula retrieval methods: For task 2, such as fuzzy string search, k-nearest neighbors, and our recently introduced approach to retrieve Mathematical Objects of Interest (MOI) with textual search queries. The task results show that neither our automated methods nor our manual runs archived good scores in the competition. However, the perceived quality of the hits returned by the MOI search particularly motivates us to conduct further research about MOI. △ Less

Submitted 10 December, 2020; v1 submitted 4 December, 2020; originally announced December 2020.

Comments: in Working Notes of {CLEF} 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020 http://ceur-ws.org/Vol-2696/paper_200.pdf

arXiv:2005.12099 [pdf, other]

doi 10.1007/978-3-030-53518-6_15

AutoMSC: Automatic Assignment of Mathematics Subject Classification Labels

Authors: Moritz Schubotz, Philipp Scharpf, Olaf Teschke, Andreas Kuehnemund, Corinna Breitinger, Bela Gipp

Abstract: Authors of research papers in the fields of mathematics, and other math-heavy disciplines commonly employ the Mathematics Subject Classification (MSC) scheme to search for relevant literature. The MSC is a hierarchical alphanumerical classification scheme that allows librarians to specify one or multiple codes for publications. Digital Libraries in Mathematics, as well as reviewing services, such… ▽ More Authors of research papers in the fields of mathematics, and other math-heavy disciplines commonly employ the Mathematics Subject Classification (MSC) scheme to search for relevant literature. The MSC is a hierarchical alphanumerical classification scheme that allows librarians to specify one or multiple codes for publications. Digital Libraries in Mathematics, as well as reviewing services, such as zbMATH and Mathematical Reviews (MR) rely on these MSC labels in their workflows to organize the abstracting and reviewing process. Especially, the coarse-grained classification determines the subject editor who is responsible for the actual reviewing process. In this paper, we investigate the feasibility of automatically assigning a coarse-grained primary classification using the MSC scheme, by regarding the problem as a multi-class classification machine learning task. We find that our method achieves an (F_1)-score of over 77%, which is remarkably close to the agreement of zbMATH and MR ((F_1)-score of 81%). Moreover, we find that the method's confidence score allows for reducing the effort by 86% compared to the manual coarse-grained classification effort while maintaining a precision of 81% for automatically classified articles. △ Less

Submitted 9 November, 2020; v1 submitted 25 May, 2020; originally announced May 2020.

Journal ref: Intelligent Computer Mathematics - 13thInternational Conference, {CICM} 2020, Bertinoro, Italy, July 26-31, 2020, Proceedings

arXiv:2003.09417 [pdf, other]

doi 10.1145/3383583.3398557

Mathematical Formulae in Wikimedia Projects 2020

Authors: Moritz Schubotz, André Greiner-Petter, Norman Meuschke, Olaf Teschke, Bela Gipp

Abstract: This poster summarizes our contributions to Wikimedia's processing pipeline for mathematical formulae. We describe how we have supported the transition from rendering formulae as course-grained PNG images in 2001 to providing modern semantically enriched language-independent MathML formulae in 2020. Additionally, we describe our plans to improve the accessibility and discoverability of mathematica… ▽ More This poster summarizes our contributions to Wikimedia's processing pipeline for mathematical formulae. We describe how we have supported the transition from rendering formulae as course-grained PNG images in 2001 to providing modern semantically enriched language-independent MathML formulae in 2020. Additionally, we describe our plans to improve the accessibility and discoverability of mathematical knowledge in Wikimedia projects further. △ Less

Submitted 6 May, 2020; v1 submitted 20 March, 2020; originally announced March 2020.

Comments: Submitted to JCDL 2020: Proceedings of the ACM/ IEEE Joint Conference on Digital Libraries in 2020 (JCDL '20), August 1-5, 2020, Virtual Event, China

arXiv:1905.03322 [pdf, other]

doi 10.1007/978-3-030-23250-4_18

Forms of Plagiarism in Digital Mathematical Libraries

Authors: Moritz Schubotz, Olaf Teschke, Vincent Stange, Norman Meuschke, Bela Gipp

Abstract: We report on an exploratory analysis of the forms of plagiarism observable in mathematical publications, which we identified by investigating editorial notes from zbMATH. While most cases we encountered were simple copies of earlier work, we also identified several forms of disguised plagiarism. We investigated 11 cases in detail and evaluate how current plagiarism detection systems perform in ide… ▽ More We report on an exploratory analysis of the forms of plagiarism observable in mathematical publications, which we identified by investigating editorial notes from zbMATH. While most cases we encountered were simple copies of earlier work, we also identified several forms of disguised plagiarism. We investigated 11 cases in detail and evaluate how current plagiarism detection systems perform in identifying these cases. Moreover, we describe the steps required to discover these and potentially undiscovered cases in the future. △ Less

Submitted 9 September, 2019; v1 submitted 8 May, 2019; originally announced May 2019.

Journal ref: Intelligent Computer Mathematics - 12th International Conference, {CICM} 2019, Prague, Czech Republic, July 8-12, 2019, Proceedings

Showing 1–12 of 12 results for author: Teschke, O