Journal:Shared metadata for data-centric materials science

Full article title	Shared metadata for data-centric materials science
Journal	Scientific Data
Author(s)	Ghiringhelli, Luca M.; Baldauf, Carsten; Bereau, Tristan; Brockhauser, Sandor; Carbogno, Christian; Chamanara, Javad; Cozzini, Stefano; Curtarolo, Stefano; Draxl, Claudia; Dwaraknath, Shyam; Fekete, Ádám; Kermode, James; Koch, Christoph T.; Kühbach, Markus; Ladines, Alvin Noe; Lambrix, Patrick; Himmer, Maja-Olivia; Levchenko, Sergey V.; Oliveira, Micael; Michalchuk, Adam; Miller, Ronald E.; Onat, Berk; Pavone, Pasquale; Pizzi, Giovanni; Regler, Benjamin; Rignanese, Gian-Marco; Schaarschmidt, Jörg; Scheidgen, Markus; Schneidewind, Astrid; Sheveleva, Tatyana; Su, Chuanxun; Usvyat, Denis; Valsson, Omar; Wöll, Christof; Scheffler, Matthias
Author affiliation(s)	Friedrich-Alexander Universität, Humboldt-Universität zu Berlin, Fritz-Haber-Institut of the Max-Planck-Gesellschaft, University of Amsterdam, TIB – Leibniz Information Centre for Science and Technology and University Library, AREA Science Park, Duke University, Lawrence Berkeley National Laboratory, University of Warwick, Linköping University, Skolkovo Institute of Science and Technology, Max Planck Institute for the Structure and Dynamics of Matter, Federal Institute for Materials Research and Testing, University of Birmingham, Carleton University, École Polytechnique Fédérale de Lausanne, Paul Scherrer Institut, Chemin des Étoiles, Karlsruhe Institute of Technology, Forschungszentrum Jülich GmbH, University of Science and Technology of China, University of North Texas
Primary contact	Email: luca dot ghiringhelli at physik dot hu dash berlin dot de
Year published	2023
Volume and issue	10
Article #	626
DOI	10.1038/s41597-023-02501-8
ISSN	2052-4463
Distribution license	Creative Commons Attribution 4.0 International
Website	https://www.nature.com/articles/s41597-023-02501-8
Download	https://www.nature.com/articles/s41597-023-02501-8.pdf (PDF)

This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed.

Abstract

The expansive production of data in materials science, as well as their widespread sharing and repurposing, requires educated support and stewardship. In order to ensure that this need helps rather than hinders scientific work, the implementation of the FAIR data principles (that ask for data and information to be findable, accessible, interoperable, and reusable) must not be too narrow. At the same time, the wider materials science community ought to agree on the strategies to tackle the challenges that are specific to its data, both from computations and experiments. In this paper, we present the result of the discussions held at the workshop on “Shared Metadata and Data Formats for Big-Data Driven Materials Science.” We start from an operative definition of metadata and the features that a FAIR-compliant metadata schema should have. We will mainly focus on computational materials science data and propose a constructive approach for the "FAIR-ification" of the (meta)data related to ground-state and excited-states calculations, potential energy sampling, and generalized workflows. Finally, challenges with the FAIR-ification of experimental (meta)data and materials science ontologies are presented, together with an outlook of how to meet them.

Keywords: materials science, data sharing, FAIR data principles, file formats, metadata, ontologies

Introduction: Metadata and FAIR data principles

The amount of data that has been produced in materials science up to today, and its day-by-day increase, are massive. [1] The dawn of the data-centric era [2] requires that such data are not just stored, but also carefully annotated in order to find, access, and possibly reuse them. Terms of good practice to be adopted by the scientific community for the management and stewardship of its data, the so-called FAIR data principles, have been compiled by the FORCE11 group. [3] Here, the acronym "FAIR" stands for "findable, accessible, interoperable, and reusable," which applies not only to data but also to metadata. Other terms for the “R” in FAIR are “repurposable” and “recyclable.” The former term indicates that data may be used for a different purpose than the original one for which they were created. The latter term hints at the fact that data in materials science are often exploited only once for supporting the thesis of a single publication, and then they are stored and forgotten. In this sense, they would constitute a “waste” that can be recycled, provided that they can be found and they are properly annotated.

Before examining the meaning and importance of the four terms of the FAIR acronym, it is worth defining what metadata are with respect to data. To that purpose, we start by introducing the concept of a data object, which represents the collective storage of information related to an elementary entry in a database. One can consider it as a row in a table, where the columns can be occupied by simple scalars, higher-order mathematical objects, strings of characters, or even full documents (or other media objects). In the materials science context, a data object is the collection of attributes (the columns in the above-mentioned table) that represent a material or, even more fundamentally, a snapshot of the material captured by a single configuration of atoms, or it may be a set of measurements from well-defined equivalent samples (see below for a discussion on this concept). For instance, in computational materials science, the attributes of a data object could be both the inputs (e.g., the coordinates and chemical species of the atoms constituting the material, the description of the physical model used for calculating its properties), and the outputs (e.g., total energy, forces, electronic density of states, etc.) of a calculation. Logically and physically, inputs and outputs are at different levels, in the sense that the former determine the latter. Hence, one can consider the inputs as metadata describing the data, i.e., the outputs. In turn, the set of coordinates A that are metadata to some observed quantities, may be considered as data that depend on another set of coordinates B, and the forces acting on the atoms in that set A. So, the set of coordinates B and the acting forces are metadata to the set A, now regarded as data. Metadata can always be considered to be data as they could be objects of different, independent analyses than those performed on the calculated properties. In this respect, whether an attribute of a data object is data or metadata depends on the context. This simple example also depicts a provenance relationship between the data and their metadata.

The above discussion can be summarized in a more general definition of the term metadata:

Metadata are attributes that are necessary to locate, fully characterize, and ultimately reproduce other attributes that are identified as data.

The metadata include a clear and unambiguous description of the data as well as their full provenance. This definition is reminiscent of the definition given by the National Institute of Standards and Technology (NIST) [4]: “Structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource. Metadata is often called data about information or information about information.” With our definition, we highlight the role of data “reproducibility,” which is crucial in science.

Within the “full characterization” requirement, we highlight interpretation of the data as a crucial aspect. In other words, the metadata must provide enough information on a stored value (therein including, e.g., adimensional constants) to make it unambiguous whether two data objects may be compared with respect to the value of a given attribute or not.

Next, we should notice that, whereas in computational materials science the concept of data object identified by a single atomic configuration is well defined, in experimental materials science the concept of a class of equivalent samples is very hard to implement operationally. For instance, a single specimen can be altered by a measurement operation and thus cannot, strictly speaking, be measured twice. At the same time, two specimens prepared with the same synthesis recipe, may differ in substantial aspects due to the presence of different impurities or even crystal phases, thus yielding different values of a measured quantity. In this respect, here we use the term "equivalent sample" in its abstract, ideal meaning, but we also mention that one of the main purposes of introducing well-defined metadata in materials science is to provide enough characterization of experimental samples to put into practice the concept of equivalent samples.

The need for storing and characterizing data by means of metadata is determined by two main aspects, related to data usage. The first aspect is as old as science: reproducibility. In an experiment or computation, all the necessary information needed to reproduce the measured/calculated data (i.e., the metadata) should be recorded, stored, and retrievable. The second aspect becomes prominent with the demand for reusability. Data can and should be also usable for purposes that were not anticipated at the time they were recorded. A useful way of looking at metadata is that they are attributes of data objects answering the questions who, what, when, where, why, and how. For example, “Who has produced the data?”, “What are the data expected to represent (in physical terms)?”, “When were they produced?”, “Where are they stored?”, “For what purpose were they produced?”, and “By means of which methods were the data obtained?”. The latter two questions also refer to the concept of provenance, i.e., the logical sequence of operations that determine, ideally univocally, the data. Keeping track of the provenance requires the possibility to record the whole workflow that has lead to some calculated or measured properties (for more details, see the later section “Metadata for computational workflows”).

From a practical point of view, the metadata are organized in a schema. We summarize what the FAIR principles imply in terms of a metadata schema as follows:

Findability is achieved by assigning unique and persistent identifiers (PIDs) to data and metadata, describing data with rich metadata, and registering (see below) the (meta)data in searchable resources. Widely known examples of PIDs are digital object identifiers (DOIs) and (permanent) Uniform Resource Identifiers (URIs). According to ISO/IEC 11179, a metadata registry (MDR) is a database of metadata that supports the functionality of registration. Registration accomplishes three main goals: identification, provenance, and monitoring quality. Furthermore, an MDR manages the semantics of the metadata, i.e., the relationships (connections) among them.
Accessibility is enabled by application programming interfaces (APIs), which allow one to query and retrieve single entries as well as entire archives.
Interoperability implies the use of formal, accessible, shared, and broadly applicable languages for knowledge representation (these are known as formal ontologies and will be discussed in the later section “Outlook on ontologies in materials science”), use of vocabularies to annotate data and metadata, and inclusion of references.
Reusability hints at the fact that data in materials science are often exploited only once for a focus-oriented research project, and many data are not even properly stored as they turned out to be irrelevant for the focus. In this sense, many data constitute a “waste” that can be recycled, provided that the data can be found and they are properly annotated.

Establishing one or more metadata schemas that are FAIR-compliant, and that therefore enable the materials science community to efficiently share the heterogeneously and decentrally produced data, needs to be a community effort. The workshop “Shared Metadata and Data Formats for Big-Data Driven Materials Science: A NOMAD–FAIR-DI Workshop” was organized and held in Berlin in July 2019 to ignite this effort. In the following sections, we describe the identified challenges and first-stage plans, divided into different aspects that are crucial to be addressed in computational materials science.

In the next section, we describe the identified challenges and first plans for FAIR metadata schemas for computational materials science, where we also summarize as an example the main ideas behind the metadata schema implemented in the Novel-Materials Discovery (NOMAD) Laboratory for storing and managing millions of data objects produced by means of atomistic calculations (both ab initio and molecular mechanics), employing tens of different codes, which cover the overwhelming majority of what is actually used in terms of volume-of-data production in the community. We then follow with more detailed sections discussing the specific challenges related to interoperability and reusability for ground-state calculations (Section “Metadata for ground-state electronic-structure calculations”), perturbative and excited-state calculations (Section “Metadata for external-perturbation and excited-state electronic-structure calculations”), potential-energy sampling (molecular-dynamics and more, Section “Metadata for potential-energy sampling”), and generalized workflows (Section “Metadata for computational workflows”) are addressed in detail in the following sections. Challenges related to the choice of file formats are discussed in Section “File Formats.” An outlook on metadata schema(s) for experimental materials science and on the introduction of formal ontologies for materials science databases constitute Sections “Metadata schemas for experimental materials science” and “Outlook on ontologies in materials science,” respectively.

Towards FAIR metadata schemas for computational materials science

The materials science community has realized long ago that it is necessary to structure data by means of metadata schemas. In this section, we describe the pioneering and recent examples of such schemas, and how a metadata schema becomes FAIR-compliant.

To our knowledge, the first systematic effort to build a metadata schema for exchanging data in chemistry and materials science is CIF, an acronym that originally stood for "Crystallographic Information File," the data exchange standard file format introduced in 1991 by Hall, Allen and Brown. [5,6] Later, the CIF acronym was extended to also mean "Crystallographic Information Framework" [7], a broader system of exchange protocols based on data dictionaries and relational rules expressible in different machine-readable manifestations. These include the Crystallographic Information File itself, but also, for instance, XML (Extensible Markup Language), a general framework for encoding text documents in a format that is meant to be at the same time human and machine readable. CIF was developed by the International Union of Crystallography (IUCr) working party on Crystallographic Information and was adopted in 1990 as a standard file structure for the archiving and distribution of crystallographic information. It is now well established and is in regular use for reporting crystal structure determinations to Acta Crystallographica and other journals. More recently, CIF has been adapted to different areas of science such as structural biology (mmCIF, the macromolecular CIF [8]) and spectroscopy. [9] The CIF framework includes strict syntax definition in a machine-readable form and dictionary defining (meta)data items. It has been noted that the adoption of the CIF framework in IUCr publications has allowed for a significant reduction of the amount of errors in published crystal structures. [10,11]

An early example of an exhaustive metadata schema for chemistry and materials science is the Chemical Markup Language (CML) [12,13,14], whose first public version was released in 1995. CML is a dictionary, encoded in XML for chemical metadata. CML is accessible (for reading, writing, and validation) via the Java library JUMBO (Java Universal Molecular/Markup Browser for Objects). [14] The general idea of CML is to represent with a common language all kinds of documents that contain chemical data, even though currently the language—as of the latest update in 2012 [15]—covers mainly the description of molecules (e.g., IUPAC name, atomic coordinates, bond distances) and of inputs/outputs of computational chemistry codes such as Gaussian03 [16] and NWChem. [17] Specifically, in the CML representation of computational chemistry calculations [18], (ideally) all the information on a simulation that is contained in the input and output files is mapped onto a format that is in principle independent of the code itself. Such information is:

Administrative data like the code version, libraries for the compilation, hardware, user submitting the job;
Materials-specific (or materials-snapshot-specific) data like computed structure (e.g., atomic species, coordinates), the physical method (e.g., electronic exchange-correlation treatment, relativistic treatment), numerical settings (basis set, integration grids, etc.);
Computed quantities (energies, forces, sequence of atomic positions in case a structure relaxation or some dynamical propagation of the system is performed, etc.).

The different types of information are hierarchically organized in modules, e.g., environment (for the code version, hardware, run date, etc.), initialization (for the exchange correlation treatment, spin, charge), molgeom (for the atomic coordinates and the localized basis set specification), and finalization (for the energies, forces, etc.). The most recent release of the CML schema contains more than 500 metadata-schema items, i.e., unique entries in the metadata schema. It is worth noticing that CIF is the dictionary of choice for the crystallography domain within CML.

References

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. Several inline URLs from the original were turned into full citations for this version.