-
Reconstructing Materials Tetrahedron: Challenges in Materials Information Extraction
Authors:
Kausik Hira,
Mohd Zaki,
Dhruvil Sheth,
Mausam,
N M Anoop Krishnan
Abstract:
The discovery of new materials has a documented history of propelling human progress for centuries and more. The behaviour of a material is a function of its composition, structure, and properties, which further depend on its processing and testing conditions. Recent developments in deep learning and natural language processing have enabled information extraction at scale from published literature…
▽ More
The discovery of new materials has a documented history of propelling human progress for centuries and more. The behaviour of a material is a function of its composition, structure, and properties, which further depend on its processing and testing conditions. Recent developments in deep learning and natural language processing have enabled information extraction at scale from published literature such as peer-reviewed publications, books, and patents. However, this information is spread in multiple formats, such as tables, text, and images, and with little or no uniformity in reporting style giving rise to several machine learning challenges. Here, we discuss, quantify, and document these challenges in automated information extraction (IE) from materials science literature towards the creation of a large materials science knowledge base. Specifically, we focus on IE from text and tables and outline several challenges with examples. We hope the present work inspires researchers to address the challenges in a coherent fashion, providing a fillip to IE towards develo** a materials knowledge base.
△ Less
Submitted 26 April, 2024; v1 submitted 12 October, 2023;
originally announced October 2023.
-
MaScQA: A Question Answering Dataset for Investigating Materials Science Knowledge of Large Language Models
Authors:
Mohd Zaki,
Jayadeva,
Mausam,
N. M. Anoop Krishnan
Abstract:
Information extraction and textual comprehension from materials literature are vital for develo** an exhaustive knowledge base that enables accelerated materials discovery. Language models have demonstrated their capability to answer domain-specific questions and retrieve information from knowledge bases. However, there are no benchmark datasets in the materials domain that can evaluate the unde…
▽ More
Information extraction and textual comprehension from materials literature are vital for develo** an exhaustive knowledge base that enables accelerated materials discovery. Language models have demonstrated their capability to answer domain-specific questions and retrieve information from knowledge bases. However, there are no benchmark datasets in the materials domain that can evaluate the understanding of the key concepts by these language models. In this work, we curate a dataset of 650 challenging questions from the materials domain that require the knowledge and skills of a materials student who has cleared their undergraduate degree. We classify these questions based on their structure and the materials science domain-based subcategories. Further, we evaluate the performance of GPT-3.5 and GPT-4 models on solving these questions via zero-shot and chain of thought prompting. It is observed that GPT-4 gives the best performance (~62% accuracy) as compared to GPT-3.5. Interestingly, in contrast to the general observation, no significant improvement in accuracy is observed with the chain of thought prompting. To evaluate the limitations, we performed an error analysis, which revealed conceptual errors (~64%) as the major contributor compared to computational errors (~36%) towards the reduced performance of LLMs. We hope that the dataset and analysis performed in this work will promote further research in develo** better materials science domain-specific LLMs and strategies for information extraction.
△ Less
Submitted 17 August, 2023;
originally announced August 2023.
-
Energy Transformer
Authors:
Benjamin Hoover,
Yuchen Liang,
Bao Pham,
Rameswar Panda,
Hendrik Strobelt,
Duen Horng Chau,
Mohammed J. Zaki,
Dmitry Krotov
Abstract:
Our work combines aspects of three promising paradigms in machine learning, namely, attention mechanism, energy-based models, and associative memory. Attention is the power-house driving modern deep learning successes, but it lacks clear theoretical foundations. Energy-based models allow a principled approach to discriminative and generative tasks, but the design of the energy functional is not st…
▽ More
Our work combines aspects of three promising paradigms in machine learning, namely, attention mechanism, energy-based models, and associative memory. Attention is the power-house driving modern deep learning successes, but it lacks clear theoretical foundations. Energy-based models allow a principled approach to discriminative and generative tasks, but the design of the energy functional is not straightforward. At the same time, Dense Associative Memory models or Modern Hopfield Networks have a well-established theoretical foundation, and allow an intuitive design of the energy function. We propose a novel architecture, called the Energy Transformer (or ET for short), that uses a sequence of attention layers that are purposely designed to minimize a specifically engineered energy function, which is responsible for representing the relationships between the tokens. In this work, we introduce the theoretical foundations of ET, explore its empirical capabilities using the image completion task, and obtain strong quantitative results on the graph anomaly detection and graph classification tasks.
△ Less
Submitted 31 October, 2023; v1 submitted 14 February, 2023;
originally announced February 2023.
-
Glass Hardness: Predicting Composition and Load Effects via Symbolic Reasoning-Informed Machine Learning
Authors:
Sajid Mannan,
Mohd Zaki,
Suresh Bishnoi,
Daniel R. Cassar,
Jeanini Jiusti,
Julio Cesar Ferreira Faria,
Johan F. S. Christensen,
Nitya Nand Gosvami,
Morten M. Smedskjaer,
Edgar Dutra Zanotto,
N. M. Anoop Krishnan
Abstract:
Glass hardness varies in a non-linear fashion with the chemical composition and applied load, a phenomenon known as the indentation size effect (ISE), which is challenging to predict quantitatively. Here, using a curated dataset of over approx. 3000 inorganic glasses from the literature comprising the composition, indentation load, and hardness, we develop machine learning (ML) models to predict t…
▽ More
Glass hardness varies in a non-linear fashion with the chemical composition and applied load, a phenomenon known as the indentation size effect (ISE), which is challenging to predict quantitatively. Here, using a curated dataset of over approx. 3000 inorganic glasses from the literature comprising the composition, indentation load, and hardness, we develop machine learning (ML) models to predict the composition and load dependence of Vickers hardness. Interestingly, when tested on new glass compositions unseen during the training, the standard data-driven ML model failed to capture the ISE. To address this gap, we combined an empirical expression (Bernhardt law) to describe the ISE with ML to develop a framework that incorporates the symbolic law representing the domain reasoning in ML, namely Symbolic Reasoning-Informed ML Procedure (SRIMP). We show that the resulting SRIMP outperforms the data-driven ML model in predicting the ISE. Finally, we interpret the SRIMP model to understand the contribution of the glass network formers and modifiers toward composition and load-dependent (ISE) and load-independent hardness. The deconvolution of the hardness into load-dependent and load-independent terms paves the way toward a holistic understanding of composition and ISE in glasses, enabling the accelerated discovery of new glass compositions with targeted hardness.
△ Less
Submitted 19 January, 2023;
originally announced January 2023.
-
Cementron: Machine Learning the Constituent Phases in Cement Clinker from Optical Images
Authors:
Mohd Zaki,
Siddhant Sharma,
Sunil Kumar Gurjar,
Raju Goyal,
Jayadeva,
N. M. Anoop Krishnan
Abstract:
Cement is the most used construction material. The performance of cement hydrate depends on the constituent phases, viz. alite, belite, aluminate, and ferrites present in the cement clinker, both qualitatively and quantitatively. Traditionally, clinker phases are analyzed from optical images relying on a domain expert and simple image processing techniques. However, the non-uniformity of the image…
▽ More
Cement is the most used construction material. The performance of cement hydrate depends on the constituent phases, viz. alite, belite, aluminate, and ferrites present in the cement clinker, both qualitatively and quantitatively. Traditionally, clinker phases are analyzed from optical images relying on a domain expert and simple image processing techniques. However, the non-uniformity of the images, variations in the geometry and size of the phases, and variabilities in the experimental approaches and imaging methods make it challenging to obtain the phases. Here, we present a machine learning (ML) approach to detect clinker microstructure phases automatically. To this extent, we create the first annotated dataset of cement clinker by segmenting alite and belite particles. Further, we use supervised ML methods to train models for identifying alite and belite regions. Specifically, we finetune the image detection and segmentation model Detectron-2 on the cement microstructure to develop a model for detecting the cement phases, namely, Cementron. We demonstrate that Cementron, trained only on literature data, works remarkably well on new images obtained from our experiments, demonstrating its generalizability. We make Cementron available for public use.
△ Less
Submitted 6 November, 2022;
originally announced November 2022.
-
Accelerated Design of Chalcogenide Glasses through Interpretable Machine Learning for Composition Property Relationships
Authors:
Sayam Singla,
Sajid Mannan,
Mohd Zaki,
N. M. Anoop Krishnan
Abstract:
Chalcogenide glasses possess several outstanding properties that enable several ground breaking applications, such as optical discs, infrared cameras, and thermal imaging systems. Despite the ubiquitous usage of these glasses, the composition property relationships in these materials remain poorly understood. Here, we use a large experimental dataset comprising approx 24000 glass compositions made…
▽ More
Chalcogenide glasses possess several outstanding properties that enable several ground breaking applications, such as optical discs, infrared cameras, and thermal imaging systems. Despite the ubiquitous usage of these glasses, the composition property relationships in these materials remain poorly understood. Here, we use a large experimental dataset comprising approx 24000 glass compositions made of 51 distinct elements from the periodic table to develop machine learning models for predicting 12 properties, namely, annealing point, bulk modulus, density, Vickers hardness, Littleton point, Youngs modulus, shear modulus, softening point, thermal expansion coefficient, glass transition temperature, liquidus temperature, and refractive index. These models, by far, are the largest for chalcogenide glasses. Further, we use SHAP, a game theory based algorithm, to interpret the output of machine learning algorithms by analyzing the contributions of each element towards the models prediction of a property. This provides a powerful tool for experimentalists to interpret the models prediction and hence design new glass compositions with targeted properties. Finally, using the models, we develop several glass selection charts that can potentially aid in the rational design of novel chalcogenide glasses for various applications.
△ Less
Submitted 1 November, 2022;
originally announced November 2022.
-
DiSCoMaT: Distantly Supervised Composition Extraction from Tables in Materials Science Articles
Authors:
Tanishq Gupta,
Mohd Zaki,
Devanshi Khatsuriya,
Kausik Hira,
N. M. Anoop Krishnan,
Mausam
Abstract:
A crucial component in the curation of KB for a scientific domain (e.g., materials science, foods & nutrition, fuels) is information extraction from tables in the domain's published research articles. To facilitate research in this direction, we define a novel NLP task of extracting compositions of materials (e.g., glasses) from tables in materials science papers. The task involves solving several…
▽ More
A crucial component in the curation of KB for a scientific domain (e.g., materials science, foods & nutrition, fuels) is information extraction from tables in the domain's published research articles. To facilitate research in this direction, we define a novel NLP task of extracting compositions of materials (e.g., glasses) from tables in materials science papers. The task involves solving several challenges in concert, such as tables that mention compositions have highly varying structures; text in captions and full paper needs to be incorporated along with data in tables; and regular languages for numbers, chemical compounds and composition expressions must be integrated into the model. We release a training dataset comprising 4,408 distantly supervised tables, along with 1,475 manually annotated dev and test tables. We also present a strong baseline DISCOMAT, that combines multiple graph neural networks with several task-specific regular expressions, features, and constraints. We show that DISCOMAT outperforms recent table processing architectures by significant margins.
△ Less
Submitted 28 January, 2024; v1 submitted 3 July, 2022;
originally announced July 2022.
-
MatSciBERT: A Materials Domain Language Model for Text Mining and Information Extraction
Authors:
Tanishq Gupta,
Mohd Zaki,
N. M. Anoop Krishnan,
Mausam
Abstract:
An overwhelmingly large amount of knowledge in the materials domain is generated and stored as text published in peer-reviewed scientific literature. Recent developments in natural language processing, such as bidirectional encoder representations from transformers (BERT) models, provide promising tools to extract information from these texts. However, direct application of these models in the mat…
▽ More
An overwhelmingly large amount of knowledge in the materials domain is generated and stored as text published in peer-reviewed scientific literature. Recent developments in natural language processing, such as bidirectional encoder representations from transformers (BERT) models, provide promising tools to extract information from these texts. However, direct application of these models in the materials domain may yield suboptimal results as the models themselves may not be trained on notations and jargon that are specific to the domain. Here, we present a materials-aware language model, namely, MatSciBERT, which is trained on a large corpus of scientific literature published in the materials domain. We further evaluate the performance of MatSciBERT on three downstream tasks, namely, abstract classification, named entity recognition, and relation extraction, on different materials datasets. We show that MatSciBERT outperforms SciBERT, a language model trained on science corpus, on all the tasks. Further, we discuss some of the applications of MatSciBERT in the materials domain for extracting information, which can, in turn, contribute to materials discovery or optimization. Finally, to make the work accessible to the larger materials community, we make the pretrained and finetuned weights and the models of MatSciBERT freely accessible.
△ Less
Submitted 30 September, 2021;
originally announced September 2021.
-
Revealing the Compositional Control of Electrical, Mechanical, Optical, and Physical Properties of Inorganic Glasses
Authors:
R. Ravinder,
Suresh Bishnoi,
Mohd Zaki,
N. M. Anoop Krishnan
Abstract:
Inorganic glasses, produced by the melt-quenching of a concoction of minerals, compounds, and elements, can possess unique optical and elastic properties along with excellent chemical, and thermal durability. Despite the ubiquitous use of glasses for critical applications such as touchscreen panels, windshields, bioactive implants, optical fibers and sensors, kitchen and laboratory glassware, ther…
▽ More
Inorganic glasses, produced by the melt-quenching of a concoction of minerals, compounds, and elements, can possess unique optical and elastic properties along with excellent chemical, and thermal durability. Despite the ubiquitous use of glasses for critical applications such as touchscreen panels, windshields, bioactive implants, optical fibers and sensors, kitchen and laboratory glassware, thermal insulators, nuclear waste immobilization, optical lenses, and solid electrolytes, their composition-structure-property relationships remain poorly understood. Here, exploiting largescale experimental data on inorganic glasses and explainable machine learning algorithms, we develop composition-property models for twenty-five properties, which are in agreement with experimental observations. These models are further interpreted using a game-theoretic concept namely, Shapley additive explanations, to understand the role of glass components in controlling the final property. The analysis reveals that the components present in the glass, such as network formers, modifiers, and the intermediates, play distinct roles in governing each of the optical, physical, electrical, and mechanical properties of glasses. Additionally, these components exhibit interdependence, the magnitude of which is different for different properties. While the physical origins of some of these interdependencies could be attributed to known phenomena such as "boron anomaly", "mixed modifier effect", and the "Loewenstein rule", the majority of the remaining ones requires further experimental and computational analysis of the glass structure. Thus, our work paves the way for decoding the "glass genome", which can provide the recipe for discovery of novel glasses, while also shedding light into the fundamental factors governing the composition-structure-property relationships.
△ Less
Submitted 23 March, 2021; v1 submitted 22 March, 2021;
originally announced March 2021.
-
Unveiling the Glass Veil: Elucidating the Optical Properties in Glasses with Interpretable Machine Learning
Authors:
Mohd Zaki,
Vineeth Venugopal,
R. Ravinder,
Suresh Bishnoi,
Sourabh Kumar Singh,
Amarnath R. Allu,
Jayadeva,
N. M. Anoop Krishnan
Abstract:
Due to their excellent optical properties, glasses are used for various applications ranging from smartphone screens to telescopes. Develo** compositions with tailored Abbe number (Vd) and refractive index (nd), two crucial optical properties, is a major challenge. To this extent, machine learning (ML) approaches have been successfully used to develop composition-property models. However, these…
▽ More
Due to their excellent optical properties, glasses are used for various applications ranging from smartphone screens to telescopes. Develo** compositions with tailored Abbe number (Vd) and refractive index (nd), two crucial optical properties, is a major challenge. To this extent, machine learning (ML) approaches have been successfully used to develop composition-property models. However, these models are essentially black-box in nature and suffer from the lack of interpretability. In this paper, we demonstrate the use of ML models to predict the composition-dependent variations of Vd and n at 587.6 nm (nd). Further, using Shapely Additive exPlanations (SHAP), we interpret the ML models to identify the contribution of each of the input components toward a target prediction. We observe that the glass formers such as SiO2, B2O3, and P2O5, and intermediates like TiO2, PbO, and Bi2O3 play a significant role in controlling the optical properties. Interestingly, components that contribute toward increasing the nd are found to decrease the Vd and vice-versa. Finally, we develop the Abbe diagram, also known as the "glass veil", using the ML models, allowing accelerated discovery of new glasses for optical properties beyond the experimental pareto front. Overall, employing explainable ML, we discover the hidden compositional control on the optical properties of oxide glasses.
△ Less
Submitted 5 March, 2021;
originally announced March 2021.