\mattext: Do Language Models Need More than Text & Scale for Materials Modeling?

Nawaf Alampara1  Santiago Miret2  Kevin Maik Jablonka1,3,4,
1FSU Jena  2Intel Labs  3CEEC Jena  4 HIPOLE Jena
[email protected]
[email protected]
[email protected]
corresponding authorLaboratory of Organic and Macromolecular Chemistry (IOMC), Friedrich Schiller University Jena, Humboldstr. 10, 07743 Jena, GermanyCenter for Energy and Environmental Chemistry Jena, Friedrich Schiller University Jena, Philosophenweg 7, 07743 Jena, Germany Helmholtz Institute for Polymers in Energy Applications Jena (HIPOLE Jena), Lessingstraße 12-14, 07743, Jena, Germany

1 \mattextDataset Description

1.1 Dataset metadata

The most relevant metadata about the \mattextdataset is summarized in Table 1.

Table 1: Tabular summary of the most relevant meta information for the \mattextdataset.
Link to dataset https://huggingface.co/datasets/n0w0f/MatText
Persistent identifier 10.57967/hf/2363
Croissant metadata https://huggingface.co/api/datasets/n0w0f/MatText/croissant
License MIT. The authors bear all responsibility in case of violation of rights.

The data can be easily loaded using HuggingFace, e.g.

dataset = load_dataset("n0w0f/MatText", "pretrain300k")

1.2 Datasheet for MatText dataset

Datasheet based on gebru2021datasheets for the \mattextdataset.

1.2.1 Motivation

For what purpose was the dataset created?

The dataset was created to enable the training and benchmarking of text-based modeling of materials properties. There has been no understanding of how different representations perform for materials modeling tasks.

Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?

Nawaf Alampara, Santiago Miret, and Kevin Maik Jablonka created the dataset. Nawaf Alampara is a Ph.D. student in the research group of Kevin Maik Jablonka at the Friedrich-Schiller University of Jena. Santiago Miret is employed by Intel Labs and a scientific collaborator of the research group.

Who funded the creation of the dataset?

The research was funded by Carl-Zeiss Foundation as well as Intel and Merck via the AWASES research center. Computations were supported by the Helmholtz Association’s Initiative and Networking Fund on the HAICORE@FZJ partition.

Any other comments?

n/a.

1.2.2 Composition

What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?

The instances represent materials. They are crystal structures of 3D-connected solid materials.

How many instances are there in total (of each type, if appropriate)?

In total, there are 2012082 instances in the pretraining dataset.

Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?

The pretraining dataset is a subset of the materials deposited in the NOMAD archive. We queried only 3D-connected structures (i.e., excluding 2D materials, which often require special treatment) and, for consistency, limited our query to materials for which the geometry optimization was done using the PBE functional and the VASP code.

The benchmarking datasets are derived from MatBench. We limited ourselves to the smaller subsets for regression tasks, for which crystal structures are provided. Some instances are dropped because text representations could not be derived.

What data does each instance consist of?

Each instance comes with text representations (in a string form). In addition, a unique ID links them to the original datasets. The fine-tuning dataset additionally provide a numeric label.

Is there a label or target associated with each instance?

There are labels for the fine-tuning datasets (gvrh-test-filtered, gvrh-train-filtered, kvrh-test-filtered, kvrh-train-filtered, perovskites-test-filtered, perovskites-train-filtered. For the gvrh datasets, the targets are the base 10 logarithm of the DFT Voigt-Reuss-Hill average shear moduli in GPa. For the kvrh datasets, the base 10 logarithm of the DFT Voigt-Reuss-Hill average bulk moduli in GPa. For the perovskite dataset, the labels are the heat of formation of the entire cell, in eV, as calculated by RPBE GGA-DFT.

Is any information missing from individual instances?

In some instances, some text representations (in particular Local-Env) are missing due to failed Voronoi tesselation, which is required to fragment crystal structures in local environments.

Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)?

There are no relevant known relationships between entities.

Are there recommended data splits (e.g., training, development/validation, testing)?

Yes, we follow the five-fold cross-validation proposed by MatBench for the benchmarking.

Are there any errors, sources of noise, or redundancies in the dataset?

To our knowledge, there are no redundancies. While we took care to avoid errors, there might be errors, for example, due to problems with the crystal structures in the raw data.

Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)?

The dataset is self-contained.

Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals’ non-public communications)?

No.

Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?

No.

Does the dataset relate to people?

No.

Does the dataset identify any subpopulations (e.g., by age, gender)?

n/a

Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset?

n/a

Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals racial or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)?

n/a

Any other comments?

n/a

1.2.3 Collection process

How was the data associated with each instance acquired?

The dataset is based on crystal structures reported in public databases. Scripts for constructing the dataset are provided on GitHub (https://github.com/lamalab-org/MatText/).

What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)?

We downloaded existing datasets (a subset of NOMAD and MatBench) and used our \mattextframework to derive text representations.

If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)?

The dataset was retrieved via the NOMAD API based on criteria such as being the output of geometry optimization using VASP and the PBE functional. No other systematic filtering was performed.

Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)?

The authors were involved in the data collection process.

Over what timeframe was the data collected?

The data collection and preparation started in October 2023.

Were any ethical review processes conducted (e.g., by an institutional review board)?

No ethical review process was conducted since the data deals with inanimate materials.

Does the dataset relate to people?

no.

Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)?

n/a

Were the individuals in question notified about the data collection?

n/a

Did the individuals in question consent to the collection and use of their data?

n/a

If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses?

n/a

Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted?

n/a

Any other comments?

n/a

1.2.4 Preprocessing/cleaning/labeling

Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?

Our main conversion was the addition of text representations.

Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)?

The pretraining data is based on the MatSciML NOMAD dataset (https://zenodo.org/records/8381476). The raw MatBench data can be obtained via the MatBench package (https://matbench.materialsproject.org/)

Is the software used to preprocess/clean/label the instances available?

The scripts are provided on GitHub (https://github.com/lamalab-org/MatText/).

Any other comments?

n/a

1.2.5 Uses

Has the dataset been used for any tasks already?

The dataset has been used to compare text representations for materials property prediction tasks.

Is there a repository that links to any or all papers or systems that use the dataset?

We will collect papers on the GitHub repository (https://github.com/lamalab-org/MatText/).

What (other) tasks could the dataset be used for?

Besides property prediction, the dataset can be used to train models for inverse design (i.e., generate materials conditioned on a property) as well as recommender systems.

Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?

To our knowledge, nothing about the composition might impact future uses.

Are there tasks for which the dataset should not be used?

The dataset is not suitable to benchmark the materials discovery ability of systems.

Any other comments?

n/a

Distribution
Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?

Yes, the dataset is publicly available.

How will the dataset will be distributed (e.g., tarball on website, API, GitHub)?

The dataset is available via HuggingFace (https://huggingface.co/datasets/n0w0f/MatText/viewer). It is deposited under the DOI 10.57967/hf/2363.

When will the dataset be distributed?

The dataset is publicly available.

Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)?

The dataset is available under MIT license.

Have any third parties imposed IP-based or other restrictions on the data associated with the instances?

No.

Do any export controls or other regulatory restrictions apply to the dataset or to individual instances?

No.

Any other comments?

1.2.6 Maintenance

Who is supporting/hosting/maintaining the dataset?

The dataset is maintained by the research group of Kevin Maik Jablonka.

How can the owner/curator/manager of the dataset be contacted (e.g., email address)?

Kevin Maik Jablonka can be contacted via email ([email protected]), Nawaf Alampara can also be contacted via email ([email protected]).

Is there an erratum?

We will keep a changelog and highlight errors on the HuggingFace dataset card (https://huggingface.co/datasets/n0w0f/MatText).

Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)?

Updates will be performed and will be communicated via the HuggingFace dataset card (https://huggingface.co/datasets/n0w0f/MatText). We will properly version new major releases using git tags.

If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were individuals in question told that their data would be retained for a fixed period of time and then deleted)?

n/a

Will older versions of the dataset continue to be supported/hosted/maintained?

The dataset will be versioned using git, and old versions will remain available via HuggingFace.

If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so?

Contributions can be proposed via the discussions feature on Hugginggface or by raising an issue on GitHub (https://github.com/lamalab-org/MatText).

Any other comments?

n/a.