Search | arXiv e-print repository

Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange

Authors: Ankit Satpute, Noah Giessing, Andre Greiner-Petter, Moritz Schubotz, Olaf Teschke, Akiko Aizawa, Bela Gipp

Abstract: Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks, often achieving performances that surpass those of humans. Despite these advancements, the domain of mathematics presents a distinctive challenge, primarily due to its specialized structure and the precision it demands. In this study, we adopted a two-step approach for investigating the profi… ▽ More Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks, often achieving performances that surpass those of humans. Despite these advancements, the domain of mathematics presents a distinctive challenge, primarily due to its specialized structure and the precision it demands. In this study, we adopted a two-step approach for investigating the proficiency of LLMs in answering mathematical questions. First, we employ the most effective LLMs, as identified by their performance on math question-answer benchmarks, to generate answers to 78 questions from the Math Stack Exchange (MSE). Second, a case analysis is conducted on the LLM that showed the highest performance, focusing on the quality and accuracy of its answers through manual evaluation. We found that GPT-4 performs best (nDCG of 0.48 and P@10 of 0.37) amongst existing LLMs fine-tuned for answering mathematics questions and outperforms the current best approach on ArqMATH3 Task1, considering P@10. Our Case analysis indicates that while the GPT-4 can generate relevant responses in certain instances, it does not consistently answer all questions accurately. This paper explores the current limitations of LLMs in navigating complex mathematical problem-solving. Through case analysis, we shed light on the gaps in LLM capabilities within mathematics, thereby setting the stage for future research and advancements in AI-driven mathematical reasoning. We make our code and findings publicly available for research: \url{https://github.com/gipplab/LLM-Investig-MathStackExchange} △ Less

Submitted 30 March, 2024; originally announced April 2024.

Comments: Accepted for publication at the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) July 14--18, 2024, Washington D.C.,USA

arXiv:2403.07910 [pdf, other]

MAGPIE: Multi-Task Media-Bias Analysis Generalization for Pre-Trained Identification of Expressions

Authors: Tomáš Horych, Martin Wessel, Jan Philip Wahle, Terry Ruas, Jerome Waßmuth, André Greiner-Petter, Akiko Aizawa, Bela Gipp, Timo Spinde

Abstract: Media bias detection poses a complex, multifaceted problem traditionally tackled using single-task models and small in-domain datasets, consequently lacking generalizability. To address this, we introduce MAGPIE, the first large-scale multi-task pre-training approach explicitly tailored for media bias detection. To enable pre-training at scale, we present Large Bias Mixture (LBM), a compilation of… ▽ More Media bias detection poses a complex, multifaceted problem traditionally tackled using single-task models and small in-domain datasets, consequently lacking generalizability. To address this, we introduce MAGPIE, the first large-scale multi-task pre-training approach explicitly tailored for media bias detection. To enable pre-training at scale, we present Large Bias Mixture (LBM), a compilation of 59 bias-related tasks. MAGPIE outperforms previous approaches in media bias detection on the Bias Annotation By Experts (BABE) dataset, with a relative improvement of 3.3% F1-score. MAGPIE also performs better than previous models on 5 out of 8 tasks in the Media Bias Identification Benchmark (MBIB). Using a RoBERTa encoder, MAGPIE needs only 15% of finetuning steps compared to single-task approaches. Our evaluation shows, for instance, that tasks like sentiment and emotionality boost all learning, all tasks enhance fake news detection, and scaling tasks leads to the best results. MAGPIE confirms that MTL is a promising approach for addressing media bias detection, enhancing the accuracy and efficiency of existing models. Furthermore, LBM is the first available resource collection focused on media bias MTL. △ Less

Submitted 15 March, 2024; v1 submitted 26 February, 2024; originally announced March 2024.

arXiv:2402.17311 [pdf, other]

SKT5SciSumm -- A Hybrid Generative Approach for Multi-Document Scientific Summarization

Authors: Huy Quoc To, Hung-Nghiep Tran, Andr'e Greiner-Petter, Felix Beierle, Akiko Aizawa

Abstract: Summarization for scientific text has shown significant benefits both for the research community and human society. Given the fact that the nature of scientific text is distinctive and the input of the multi-document summarization task is substantially long, the task requires sufficient embedding generation and text truncation without losing important information. To tackle these issues, in this p… ▽ More Summarization for scientific text has shown significant benefits both for the research community and human society. Given the fact that the nature of scientific text is distinctive and the input of the multi-document summarization task is substantially long, the task requires sufficient embedding generation and text truncation without losing important information. To tackle these issues, in this paper, we propose SKT5SciSumm - a hybrid framework for multi-document scientific summarization (MDSS). We leverage the Sentence-Transformer version of Scientific Paper Embeddings using Citation-Informed Transformers (SPECTER) to encode and represent textual sentences, allowing for efficient extractive summarization using k-means clustering. We employ the T5 family of models to generate abstractive summaries using extracted sentences. SKT5SciSumm achieves state-of-the-art performance on the Multi-XScience dataset. Through extensive experiments and evaluation, we showcase the benefits of our model by using less complicated models to achieve remarkable results, thereby highlighting its potential in advancing the field of multi-document summarization for scientific text. △ Less

Submitted 27 February, 2024; originally announced February 2024.

arXiv:2401.16969 [pdf, ps, other]

doi 10.1007/978-3-031-56066-8_2

Taxonomy of Mathematical Plagiarism

Authors: Ankit Satpute, Andre Greiner-Petter, Noah Gießing, Isabel Beckenbach, Moritz Schubotz, Olaf Teschke, Akiko Aizawa, Bela Gipp

Abstract: Plagiarism is a pressing concern, even more so with the availability of large language models. Existing plagiarism detection systems reliably find copied and moderately reworded text but fail for idea plagiarism, especially in mathematical science, which heavily uses formal mathematical notation. We make two contributions. First, we establish a taxonomy of mathematical content reuse by annotating… ▽ More Plagiarism is a pressing concern, even more so with the availability of large language models. Existing plagiarism detection systems reliably find copied and moderately reworded text but fail for idea plagiarism, especially in mathematical science, which heavily uses formal mathematical notation. We make two contributions. First, we establish a taxonomy of mathematical content reuse by annotating potentially plagiarised 122 scientific document pairs. Second, we analyze the best-performing approaches to detect plagiarism and mathematical content similarity on the newly established taxonomy. We found that the best-performing methods for plagiarism and math content similarity achieve an overall detection score (PlagDet) of 0.06 and 0.16, respectively. The best-performing methods failed to detect most cases from all seven newly established math similarity types. Outlined contributions will benefit research in plagiarism detection systems, recommender systems, question-answering systems, and search engines. We make our experiment's code and annotated dataset available to the community: https://github.com/gipplab/Taxonomy-of-Mathematical-Plagiarism △ Less

Submitted 31 May, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

Comments: 46th European Conference on Information Retrieval (ECIR)

arXiv:2305.16433 [pdf, other]

Neural Machine Translation for Mathematical Formulae

Authors: Felix Petersen, Moritz Schubotz, Andre Greiner-Petter, Bela Gipp

Abstract: We tackle the problem of neural machine translation of mathematical formulae between ambiguous presentation languages and unambiguous content languages. Compared to neural machine translation on natural language, mathematical formulae have a much smaller vocabulary and much longer sequences of symbols, while their translation requires extreme precision to satisfy mathematical information needs. In… ▽ More We tackle the problem of neural machine translation of mathematical formulae between ambiguous presentation languages and unambiguous content languages. Compared to neural machine translation on natural language, mathematical formulae have a much smaller vocabulary and much longer sequences of symbols, while their translation requires extreme precision to satisfy mathematical information needs. In this work, we perform the tasks of translating from LaTeX to Mathematica as well as from LaTeX to semantic LaTeX. While recurrent, recursive, and transformer networks struggle with preserving all contained information, we find that convolutional sequence-to-sequence networks achieve 95.1% and 90.7% exact matches, respectively. △ Less

Submitted 25 May, 2023; originally announced May 2023.

Comments: Published at ACL 2023

arXiv:2305.13193 [pdf, other]

TEIMMA: The First Content Reuse Annotator for Text, Images, and Math

Authors: Ankit Satpute, André Greiner-Petter, Moritz Schubotz, Norman Meuschke, Akiko Aizawa, Olaf Teschke, Bela Gipp

Abstract: This demo paper presents the first tool to annotate the reuse of text, images, and mathematical formulae in a document pair -- TEIMMA. Annotating content reuse is particularly useful to develop plagiarism detection algorithms. Real-world content reuse is often obfuscated, which makes it challenging to identify such cases. TEIMMA allows entering the obfuscation type to enable novel classifications… ▽ More This demo paper presents the first tool to annotate the reuse of text, images, and mathematical formulae in a document pair -- TEIMMA. Annotating content reuse is particularly useful to develop plagiarism detection algorithms. Real-world content reuse is often obfuscated, which makes it challenging to identify such cases. TEIMMA allows entering the obfuscation type to enable novel classifications for confirmed cases of plagiarism. It enables recording different reuse types for text, images, and mathematical formulae in HTML and supports users by visualizing the content reuse in a document pair using similarity detection methods for text and math. △ Less

Submitted 13 June, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

arXiv:2305.07335 [pdf, other]

doi 10.5281/zenodo.7924634

Methods and Tools to Advance the Retrieval of Mathematical Knowledge from Digital Libraries for Search-, Recommendation-, and Assistance-Systems

Authors: Bela Gipp, André Greiner-Petter, Moritz Schubotz, Norman Meuschke

Abstract: This project investigated new approaches and technologies to enhance the accessibility of mathematical content and its semantic information for a broad range of information retrieval applications. To achieve this goal, the project addressed three main research challenges: (1) syntactic analysis of mathematical expressions, (2) semantic enrichment of mathematical expressions, and (3) evaluation usi… ▽ More This project investigated new approaches and technologies to enhance the accessibility of mathematical content and its semantic information for a broad range of information retrieval applications. To achieve this goal, the project addressed three main research challenges: (1) syntactic analysis of mathematical expressions, (2) semantic enrichment of mathematical expressions, and (3) evaluation using quality metrics and demonstrators. To make our research useful for the research community, we published tools that enable researchers to process mathematical expressions more effectively and efficiently. △ Less

Submitted 12 May, 2023; originally announced May 2023.

Comments: The final report for the DFG-Project MathIR - July 1st, 2018 - December 31st, 2022

Report number: GI 1259-1 ACM Class: H.3.0

arXiv:2211.08361 [pdf, other]

doi 10.13140/RG.2.2.30988.18568

Collaborative and AI-aided Exam Question Generation using Wikidata in Education

Authors: Philipp Scharpf, Moritz Schubotz, Andreas Spitz, Andre Greiner-Petter, Bela Gipp

Abstract: Since the COVID-19 outbreak, the use of digital learning or education platforms has significantly increased. Teachers now digitally distribute homework and provide exercise questions. In both cases, teachers need to continuously develop novel and individual questions. This process can be very time-consuming and should be facilitated and accelerated both through exchange with other teachers and by… ▽ More Since the COVID-19 outbreak, the use of digital learning or education platforms has significantly increased. Teachers now digitally distribute homework and provide exercise questions. In both cases, teachers need to continuously develop novel and individual questions. This process can be very time-consuming and should be facilitated and accelerated both through exchange with other teachers and by using Artificial Intelligence (AI) capabilities. To address this need, we propose a multilingual Wikimedia framework that allows for collaborative worldwide teacher knowledge engineering and subsequent AI-aided question generation, test, and correction. As a proof of concept, we present >>PhysWikiQuiz<<, a physics question generation and test engine. Our system (hosted by Wikimedia at https://physwikiquiz.wmflabs.org) retrieves physics knowledge from the open community-curated database Wikidata. It can generate questions in different variations and verify answer values and units using a Computer Algebra System (CAS). We evaluate the performance on a public benchmark dataset at each stage of the system workflow. For an average formula with three variables, the system can generate and correct up to 300 questions for individual students based on a single formula concept name as input by the teacher. △ Less

Submitted 15 November, 2022; originally announced November 2022.

MSC Class: 68Uxx ACM Class: H.4

arXiv:2211.04049 [pdf, ps, other]

doi 10.3389/frma.2022.861944

Caching and Reproducibility: Making Data Science experiments faster and FAIRer

Authors: Moritz Schubotz, Ankit Satpute, Andre Greiner-Petter, Akiko Aizawa, Bela Gipp

Abstract: Small to medium-scale data science experiments often rely on research software developed ad-hoc by individual scientists or small teams. Often there is no time to make the research software fast, reusable, and open access. The consequence is twofold. First, subsequent researchers must spend significant work hours building upon the proposed hypotheses or experimental framework. In the worst case, o… ▽ More Small to medium-scale data science experiments often rely on research software developed ad-hoc by individual scientists or small teams. Often there is no time to make the research software fast, reusable, and open access. The consequence is twofold. First, subsequent researchers must spend significant work hours building upon the proposed hypotheses or experimental framework. In the worst case, others cannot reproduce the experiment and reuse the findings for subsequent research. Second, suppose the ad-hoc research software fails during often long-running computationally expensive experiments. In that case, the overall effort to iteratively improve the software and rerun the experiments creates significant time pressure on the researchers. We suggest making caching an integral part of the research software development process, even before the first line of code is written. This article outlines caching recommendations for develo** research software in data science projects. Our recommendations provide a perspective to circumvent common problems such as propriety dependence, speed, etc. At the same time, caching contributes to the reproducibility of experiments in the open science workflow. Concerning the four guiding principles, i.e., Findability, Accessibility, Interoperability, and Reusability (FAIR), we foresee that including the proposed recommendation in a research software development will make the data related to that software FAIRer for both machines and humans. We exhibit the usefulness of some of the proposed recommendations on our recently completed research software project in mathematical information retrieval. △ Less

Submitted 9 November, 2022; v1 submitted 8 November, 2022; originally announced November 2022.

Comments: 8 pages, 1 table

Journal ref: Frontiers in Research Metrics and Analytics, volume 7, 2022

arXiv:2201.09488 [pdf, other]

doi 10.1007/978-3-030-99524-9_5

Comparative Verification of the Digital Library of Mathematical Functions and Computer Algebra Systems

Authors: André Greiner-Petter, Howard S. Cohl, Abdou Youssef, Moritz Schubotz, Avi Trost, Rajen Dey, Akiko Aizawa, Bela Gipp

Abstract: Digital mathematical libraries assemble the knowledge of years of mathematical research. Numerous disciplines (e.g., physics, engineering, pure and applied mathematics) rely heavily on compendia gathered findings. Likewise, modern research applications rely more and more on computational solutions, which are often calculated and verified by computer algebra systems. Hence, the correctness, accurac… ▽ More Digital mathematical libraries assemble the knowledge of years of mathematical research. Numerous disciplines (e.g., physics, engineering, pure and applied mathematics) rely heavily on compendia gathered findings. Likewise, modern research applications rely more and more on computational solutions, which are often calculated and verified by computer algebra systems. Hence, the correctness, accuracy, and reliability of both digital mathematical libraries and computer algebra systems is a crucial attribute for modern research. In this paper, we present a novel approach to verify a digital mathematical library and two computer algebra systems with one another by converting mathematical expressions from one system to the other. We use our previously eveloped conversion tool (referred to as LaCASt) to translate formulae from the NIST Digital Library of Mathematical Functions to the computer algebra systems Maple and Mathematica. The contributions of our presented work are as follows: (1) we present the most comprehensive verification of computer algebra systems and digital mathematical libraries with one another; (2) we significantly enhance the performance of the underlying translator in terms of coverage and accuracy; and (3) we provide open access to translations for Maple and Mathematica of the formulae in the NIST Digital Library of Mathematical Functions. △ Less

Submitted 31 March, 2022; v1 submitted 24 January, 2022; originally announced January 2022.

Journal ref: In: TACAS, Apr. 2022, pp. 87-105

arXiv:2109.08899 [pdf, other]

doi 10.1007/978-3-319-96812-4_4

Automated Symbolic and Numerical Testing of DLMF Formulae using Computer Algebra Systems

Authors: Howard S. Cohl, André Greiner-Petter, Moritz Schubotz

Abstract: We have developed an automated procedure for symbolic and numerical testing of formulae extracted from the NIST Digital Library of Mathematical Functions (DLMF). For the NIST Digital Repository of Mathematical Formulae, we have developed conversion tools from semantic LaTeX to the Computer Algebra System (CAS) Maple which relies on Youssef's part-of-math tagger. We convert a test data subset of 4,… ▽ More We have developed an automated procedure for symbolic and numerical testing of formulae extracted from the NIST Digital Library of Mathematical Functions (DLMF). For the NIST Digital Repository of Mathematical Formulae, we have developed conversion tools from semantic LaTeX to the Computer Algebra System (CAS) Maple which relies on Youssef's part-of-math tagger. We convert a test data subset of 4,078 semantic LaTeX DLMF formulae %extracted from the DLMF to the native CAS representation and then apply an automated scheme for symbolic and numerical testing and verification. Our framework is implemented using Java and Maple. We describe in detail the conversion process which is required so that the CAS can correctly interpret the mathematical representation of the formulae. We describe the improvement of the effectiveness of our automated scheme through incremental enhancement (making more precise) of the mathematical semantic markup for the formulae. △ Less

Submitted 18 September, 2021; originally announced September 2021.

Comments: Appeared in the Proceedings of the 11th International Conference on Intelligent Computer Mathematics (CICM) 2018

arXiv:2109.08655 [pdf, other]

doi 10.1007/978-3-319-62075-6_9

Semantic Preserving Bijective Map**s of Mathematical Formulae between Document Preparation Systems and Computer Algebra Systems

Authors: Howard S. Cohl, Moritz Schubotz, Abdou Youssef, André Greiner-Petter, Jürgen Gerhard, Bonita V. Saunders, Marjorie A. ~McClain

Abstract: Document preparation systems like LaTeX offer the ability to render mathematical expressions as one would write these on paper. Using LaTeX, LaTeXML, and tools generated for use in the National Institute of Standards (NIST) Digital Library of Mathematical Functions, semantically enhanced mathematical LaTeX markup (semantic LaTeX) is achieved by using a semantic macro set. Computer algebra systems… ▽ More Document preparation systems like LaTeX offer the ability to render mathematical expressions as one would write these on paper. Using LaTeX, LaTeXML, and tools generated for use in the National Institute of Standards (NIST) Digital Library of Mathematical Functions, semantically enhanced mathematical LaTeX markup (semantic LaTeX) is achieved by using a semantic macro set. Computer algebra systems (CAS) such as Maple and Mathematica use alternative markup to represent mathematical expressions. By taking advantage of Youssef's Part-of-Math tagger and CAS internal representations, we develop algorithms to translate mathematical expressions represented in semantic LaTeX to corresponding CAS representations and vice versa. We have also developed tools for translating the entire Wolfram Encoding Continued Fraction Knowledge and University of Antwerp Continued Fractions for Special Functions datasets, for use in the NIST Digital Repository of Mathematical Formulae. The overall goal of these efforts is to provide semantically enriched standard conforming MathML representations to the public for formulae in digital mathematics libraries. These representations include presentation MathML, content MathML, generic LaTeX, semantic LaTeX, and now CAS representations as well. △ Less

Submitted 17 September, 2021; originally announced September 2021.

Comments: Proceedings of the 10th International Conference on Intelligent Computer Mathematics (CICM)

arXiv:2109.08539 [pdf, other]

doi 10.1007/978-3-319-96812-4_9

MathTools: An Open API for Convenient MathML Handling

Authors: André Greiner-Petter, Moritz Schubotz, Howard S. Cohl, Bela Gipp

Abstract: Mathematical formulae carry complex and essential semantic information in a variety of formats. Accessing this information with different systems requires a standardized machine-readable format that is capable of encoding presentational and semantic information. Even though MathML is an official recommendation by W3C and an ISO standard for representing mathematical expressions, we could identify… ▽ More Mathematical formulae carry complex and essential semantic information in a variety of formats. Accessing this information with different systems requires a standardized machine-readable format that is capable of encoding presentational and semantic information. Even though MathML is an official recommendation by W3C and an ISO standard for representing mathematical expressions, we could identify only very few systems which use the full descriptiveness of MathML. MathML's high complexity results in a steep learning curve for novice users. We hypothesize that this complexity is the reason why many community-driven projects refrain from using MathML, and instead develop problem-specific data formats for their purposes. We provide a user-friendly, open-source application programming interface for controlling MathML data. Our API is written in JAVA and allows to create, manipulate, and efficiently access commonly needed information in presentation and content MathML. Our interface also provides tools for calculating differences and similarities between MathML expressions. The API also allows to determine the distance between expressions using different similarity measures. In addition, we provide adapters for numerous conversion tools and the canonicalization project. Our toolkit facilitates processing of mathematics for digital libraries, without the need to obtain XML expertise. △ Less

Submitted 17 September, 2021; originally announced September 2021.

Comments: Published in Proceedings of the International Conference on Intelligent Computer Mathematics (CICM) 2018

arXiv:2012.02413 [pdf]

ARQMath Lab: An Incubator for Semantic Formula Search in zbMATH Open?

Authors: Philipp Scharpf, Moritz Schubotz, Andre Greiner-Petter, Malte Ostendorff, Olaf Teschke, Bela Gipp

Abstract: The zbMATH database contains more than 4 million bibliographic entries. We aim to provide easy access to these entries. Therefore, we maintain different index structures, including a formula index. To optimize the findability of the entries in our database, we continuously investigate new approaches to satisfy the information needs of our users. We believe that the findings from the ARQMath evalua… ▽ More The zbMATH database contains more than 4 million bibliographic entries. We aim to provide easy access to these entries. Therefore, we maintain different index structures, including a formula index. To optimize the findability of the entries in our database, we continuously investigate new approaches to satisfy the information needs of our users. We believe that the findings from the ARQMath evaluation will generate new insights into which index structures are most suitable to satisfy mathematical information needs. Search engines, recommender systems, plagiarism checking software, and many other added-value services acting on databases such as the arXiv and zbMATH need to combine natural and formula language. One initial approach to address this challenge is to enrich the mostly unstructured document data via Entity Linking. The ARQMath Task at CLEF 2020 aims to tackle the problem of linking newly posted questions from Math Stack Exchange (MSE) to existing ones that were already answered by the community. To deeply understand MSE information needs, answer-, and formula types, we performed manual runs for tasks 1 and 2. Furthermore, we explored several formula retrieval methods: For task 2, such as fuzzy string search, k-nearest neighbors, and our recently introduced approach to retrieve Mathematical Objects of Interest (MOI) with textual search queries. The task results show that neither our automated methods nor our manual runs archived good scores in the competition. However, the perceived quality of the hits returned by the MOI search particularly motivates us to conduct further research about MOI. △ Less

Submitted 10 December, 2020; v1 submitted 4 December, 2020; originally announced December 2020.

Comments: in Working Notes of {CLEF} 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020 http://ceur-ws.org/Vol-2696/paper_200.pdf

arXiv:2011.14616 [pdf, other]

Automatic Mathematical Information Retrieval to Perform Translations up to Computer Algebra Systems

Authors: André Greiner-Petter

Abstract: In mathematics, LaTeX is the de facto standard to prepare documents, e.g., scientific publications. While some formulae are still developed using pen and paper, more complicated mathematical expressions used more and more often with computer algebra systems. Mathematical expressions are often manually transcribed to computer algebra systems. The goal of my doctoral thesis is to improve the efficie… ▽ More In mathematics, LaTeX is the de facto standard to prepare documents, e.g., scientific publications. While some formulae are still developed using pen and paper, more complicated mathematical expressions used more and more often with computer algebra systems. Mathematical expressions are often manually transcribed to computer algebra systems. The goal of my doctoral thesis is to improve the efficiency of this workflow. My envisioned method will automatically semantically enrich mathematical expressions so that they can be imported to computer algebra systems and other systems that can take advantage of the semantics, such as search engines or automatic plagiarism detection systems. These imports should preserve the essential semantic features of the expression. △ Less

Submitted 30 November, 2020; originally announced November 2020.

Comments: Doctoral Consortium Paper at the Joint Conference on Digital Libraries (JCDL), Fort Worth, TX, USA, June 03-07, 2018

Journal ref: Bulletin of IEEE Technical Committee on Digital Libraries 15.1 (Jan. 2019)

arXiv:2003.09417 [pdf, other]

doi 10.1145/3383583.3398557

Mathematical Formulae in Wikimedia Projects 2020

Authors: Moritz Schubotz, André Greiner-Petter, Norman Meuschke, Olaf Teschke, Bela Gipp

Abstract: This poster summarizes our contributions to Wikimedia's processing pipeline for mathematical formulae. We describe how we have supported the transition from rendering formulae as course-grained PNG images in 2001 to providing modern semantically enriched language-independent MathML formulae in 2020. Additionally, we describe our plans to improve the accessibility and discoverability of mathematica… ▽ More This poster summarizes our contributions to Wikimedia's processing pipeline for mathematical formulae. We describe how we have supported the transition from rendering formulae as course-grained PNG images in 2001 to providing modern semantically enriched language-independent MathML formulae in 2020. Additionally, we describe our plans to improve the accessibility and discoverability of mathematical knowledge in Wikimedia projects further. △ Less

Submitted 6 May, 2020; v1 submitted 20 March, 2020; originally announced March 2020.

Comments: Submitted to JCDL 2020: Proceedings of the ACM/ IEEE Joint Conference on Digital Libraries in 2020 (JCDL '20), August 1-5, 2020, Virtual Event, China

arXiv:2002.02712 [pdf, other]

doi 10.1145/3366423.3380218

Discovering Mathematical Objects of Interest -- A Study of Mathematical Notations

Authors: Andre Greiner-Petter, Moritz Schubotz, Fabian Mueller, Corinna Breitinger, Howard S. Cohl, Akiko Aizawa, Bela Gipp

Abstract: Mathematical notation, i.e., the writing system used to communicate concepts in mathematics, encodes valuable information for a variety of information search and retrieval systems. Yet, mathematical notations remain mostly unutilized by today's systems. In this paper, we present the first in-depth study on the distributions of mathematical notation in two large scientific corpora: the open access… ▽ More Mathematical notation, i.e., the writing system used to communicate concepts in mathematics, encodes valuable information for a variety of information search and retrieval systems. Yet, mathematical notations remain mostly unutilized by today's systems. In this paper, we present the first in-depth study on the distributions of mathematical notation in two large scientific corpora: the open access arXiv (2.5B mathematical objects) and the mathematical reviewing service for pure and applied mathematics zbMATH (61M mathematical objects). Our study lays a foundation for future research projects on mathematical information retrieval for large scientific corpora. Further, we demonstrate the relevance of our results to a variety of use-cases. For example, to assist semantic extraction systems, to improve scientific search engines, and to facilitate specialized math recommendation systems. The contributions of our presented research are as follows: (1) we present the first distributional analysis of mathematical formulae on arXiv and zbMATH; (2) we retrieve relevant mathematical objects for given textual search queries (e.g., linking $P_{n}^{(α, β)}\!\left(x\right)$ with `Jacobi polynomial'); (3) we extend zbMATH's search engine by providing relevant mathematical formulae; and (4) we exemplify the applicability of the results by presenting auto-completion for math inputs as the first contribution to math recommendation systems. To expedite future research projects, we have made available our source code and data. △ Less

Submitted 22 June, 2021; v1 submitted 7 February, 2020; originally announced February 2020.

Comments: Proceedings of The Web Conference 2020 (WWW'20), April 20--24, 2020, Taipei, Taiwan

arXiv:1906.11485 [pdf, other]

doi 10.1108/AJIM-08-2018-0185

Semantic Preserving Bijective Map**s for Expressions involving Special Functions in Computer Algebra Systems and Document Preparation Systems

Authors: Andre Greiner-Petter, Moritz Schubotz, Howard S. Cohl, Bela Gipp

Abstract: Purpose: Modern mathematicians and scientists of math-related disciplines often use Document Preparation Systems (DPS) to write and Computer Algebra Systems (CAS) to calculate mathematical expressions. Usually, they translate the expressions manually between DPS and CAS. This process is time-consuming and error-prone. Our goal is to automate this translation. This paper uses Maple and Mathematica… ▽ More Purpose: Modern mathematicians and scientists of math-related disciplines often use Document Preparation Systems (DPS) to write and Computer Algebra Systems (CAS) to calculate mathematical expressions. Usually, they translate the expressions manually between DPS and CAS. This process is time-consuming and error-prone. Our goal is to automate this translation. This paper uses Maple and Mathematica as the CAS, and LaTeX as our DPS. Design/methodology/approach: Bruce Miller at the National Institute of Standards and Technology (NIST) developed a collection of special LaTeX macros that create links from mathematical symbols to their definitions in the NIST Digital Library of Mathematical Functions (DLMF). We are using these macros to perform rule-based translations between the formulae in the DLMF and CAS. Moreover, we develop software to ease the creation of new rules and to discover inconsistencies. Findings: We created 396 map**s and translated 58.8% of DLMF formulae (2,405 expressions) successfully between Maple and DLMF. For a significant percentage, the special function definitions in Maple and the DLMF were different. Therefore, an atomic symbol in one system maps to a composite expression in the other system. The translator was also successfully used for automatic verification of mathematical online compendia and CAS. Our evaluation techniques discovered two errors in the DLMF and one defect in Maple. Originality: This paper introduces the first translation tool for special functions between LaTeX and CAS. The approach improves error-prone manual translations and can be used to verify mathematical online compendia and CAS. △ Less

Submitted 27 June, 2019; originally announced June 2019.

Comments: This work was supported by the German Research Foundation (DFG, grant GI-1259-1)

arXiv:1905.08359 [pdf, other]

Why Machines Cannot Learn Mathematics, Yet

Authors: André Greiner-Petter, Terry Ruas, Moritz Schubotz, Akiko Aizawa, William Grosky, Bela Gipp

Abstract: Nowadays, Machine Learning (ML) is seen as the universal solution to improve the effectiveness of information retrieval (IR) methods. However, while mathematics is a precise and accurate science, it is usually expressed by less accurate and imprecise descriptions, contributing to the relative dearth of machine learning applications for IR in this domain. Generally, mathematical documents communica… ▽ More Nowadays, Machine Learning (ML) is seen as the universal solution to improve the effectiveness of information retrieval (IR) methods. However, while mathematics is a precise and accurate science, it is usually expressed by less accurate and imprecise descriptions, contributing to the relative dearth of machine learning applications for IR in this domain. Generally, mathematical documents communicate their knowledge with an ambiguous, context-dependent, and non-formal language. Given recent advances in ML, it seems canonical to apply ML techniques to represent and retrieve mathematics semantically. In this work, we apply popular text embedding techniques to the arXiv collection of STEM documents and explore how these are unable to properly understand mathematics from that corpus. In addition, we also investigate the missing aspects that would allow mathematics to be learned by computers. △ Less

Submitted 20 May, 2019; originally announced May 2019.

Comments: Submitted to 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries colocated at the 42nd International ACM SIGIR Conference

Journal ref: 2019 http://ceur-ws.org/Vol-2414/paper14.pdf

arXiv:1804.04956 [pdf, other]

doi 10.1145/3197026.3197058

Improving the Representation and Conversion of Mathematical Formulae by Considering their Textual Context

Authors: Moritz Schubotz, Andre Greiner-Petter, Philipp Scharpf, Norman Meuschke, Howard Cohl, Bela Gipp

Abstract: Mathematical formulae represent complex semantic information in a concise form. Especially in Science, Technology, Engineering, and Mathematics, mathematical formulae are crucial to communicate information, e.g., in scientific papers, and to perform computations using computer algebra systems. Enabling computers to access the information encoded in mathematical formulae requires machine-readable f… ▽ More Mathematical formulae represent complex semantic information in a concise form. Especially in Science, Technology, Engineering, and Mathematics, mathematical formulae are crucial to communicate information, e.g., in scientific papers, and to perform computations using computer algebra systems. Enabling computers to access the information encoded in mathematical formulae requires machine-readable formats that can represent both the presentation and content, i.e., the semantics, of formulae. Exchanging such information between systems additionally requires conversion methods for mathematical representation formats. We analyze how the semantic enrichment of formulae improves the format conversion process and show that considering the textual context of formulae reduces the error rate of such conversions. Our main contributions are: (1) providing an openly available benchmark dataset for the mathematical format conversion task consisting of a newly created test collection, an extensive, manually curated gold standard and task-specific evaluation metrics; (2) performing a quantitative evaluation of state-of-the-art tools for mathematical format conversions; (3) presenting a new approach that considers the textual context of formulae to reduce the error rate for mathematical format conversions. Our benchmark dataset facilitates future research on mathematical format conversions as well as research on many problems in mathematical information retrieval. Because we annotated and linked all components of formulae, e.g., identifiers, operators and other entities, to Wikidata entries, the gold standard can, for instance, be used to train methods for formula concept discovery and recognition. Such methods can then be applied to improve mathematical information retrieval systems, e.g., for semantic formula search, recommendation of mathematical content, or detection of mathematical plagiarism. △ Less

Submitted 13 April, 2018; originally announced April 2018.

Comments: 10 pages, 4 figures

Journal ref: Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), Jun. 2018, Fort Worth, USA

Showing 1–20 of 20 results for author: Greiner-Petter, A