Multi-Spectra Data Formats

There are a few data formats that should be considered for holding multiple spectra. We will describe a few options here. The list here echos the discussion in [Ravel et al. (2012)], but may not be comprehensive. Other formats could be considered – please propose alternatives. Some important criteria to consider for these formats are:

  1. How well do they map onto a single XDI file? Can 1000 XDI files be extracted from a container of 1000 XAS spectrum? How much work would that be for a non-programming scientist?

  2. How efficient are these formats at storing data?

  3. How tied to one library or application are these files? How well supported and long-lived is that application and support expected to be?

  4. How likely is it that the data from these files will be easily extractable in 50 years?

Most of the format discussed below are meant to be general-purpose containers of data, and so would require a “Schema” or set of tags for data and metadata to be defined. In order for criteria 1 to be met, this should closely match that of XDI. It may be important to have tools to validate multi-spectra files, and potentially guide corrections to ensure compliance.

Zip file of XDI files

For the limitations given above, XDI does not prevent 100 data columns in a plain text file. A simple Zip file of 1000 XDI data files is a perfectly reasonable way to contain and transmit multiple data files. For the criteria given above, this approach has very favorable answers.

The main drawback I see for this approach is that it still requires conversion of “Raw” beamline data files into a format that follows XDI conventions very closely (with fully compliant tags and columns with names exactly matching “i0”, “itrans”, “ifluor”, etc). If we conclude that “Zip file of XDI files” is OK, a beamline or facility that collects plain text files that are XDI-like could simply conclude that a Zip file of as-collected text files is satisfactory.

While XDI is fairly simple and has a library for reading and validating its format, it is reasonable to admit that it is used only within the XAS community, and is not yet used uniformly within that community. Using this format (or probably any other format) may really want improved validation software, including for multiple spectra, and some thought about support within the XAS community.

Sqlite

Sqlite was proposed in [Ravel et al. (2012)] as a potential format for multiple spectra. Indeed, I admit to being in favor of Sqlite for some use cases. Sqlite does require a single library and application to read the format, but it is ubiquitous and extremely well-supported and portable. Still, extracting data from an Sqlite file requires use and knowledge of SQL. SQL databases are general-purpose ways to store complex data structures, but are often not trivial to work with: the use of tables and relations can be somewhat confusing even for people trained in their use. In addition, Sqlite by itself is not really designed or well-suited to storing numerical arrays of data.

I would not recommend using Sqlite as a standard format for XAS data using the criteria listed above.

XML, JSON, etc

Like XDI, XML is a plain text format. It is highly and rigorously structured and capable of holding complex and nested data structures in a hierarchical form. Despite being plain text, it is not really designed to be human-readable. JSON, and host of other formats and markup languages (YAML, TOML, etc) are similar to XML, in that the can hold arbitrarily nested data structures in plain text files. These are generally somewhat easier to work with and sometimes more human-readable than XML. All of these formats are in wide-spread use, and have multiple libraries to support reading, writing, and validation. As with Sqlite, none of these have especially good support for arrays of numerical data.

Though in common use for many purposes, I would say that these formats offer little advantage over the other formats discussed.

CIF

Like XML, JSON, etc, CIF is plain text format capable – at least in principle – of holding complex and nested data structures in a hierarchical format. It is used within the scientific community to hold crystallographic information, which does indeed contain complex data structures. It is not used outside of the crystallography community, and libraries to work with CIF files have not had a great history of support.

CIF can support multiple data tables in a file, but with an unusual syntactical approach to hierarchical formatting that can be confusing and somewhat fragile. While CIF is used for crystal structure data, it is not widely used (as far as I know, it is not used at all) for storing or transmitting arrays of primary experimental data. I believe there is not a standard “schema” for experimental data (indeed, in my experience, Schema for CIF are not always clearly established or followed for crystallographic data).

Although CIF was proposed in [Ravel et al. (2012)] as a potential format for multiple spectra, I think it has few advantages over other approaches.

HDF5 and NeXuS

HDF5 was proposed in [Ravel et al. (2012)] as a potential format for multiple spectra. This format explicitly supported multi-dimensional arrays of numerical data such as those collected at synchrotrons, and is in wide use at synchrotron facilities for collecting large datasets for imaging, scattering, and spectroscopy experiments. Notably, it has very good support compressions of numerical data.

HDF5 uses a hierarchical format much like a computer file system, with Folders called Groups that contain Datasets and other Groups. Though perhaps not elegant and formally complete, this gives a familiar and easy to navigate approach to accessing multiple spectra and arrays of data. Metadata is supported for all Datasets and Groups. Compared to XML, JSON, etc, storing some complex non-numerical data structures that are not numerical arrays can be a bit cumbersome (perhaps using JSON-encoded string of a keyword/value dictionary) but is not too bad, and should not be considered a stumbling block. As it turns out, this mixed approach was not needed when mapping XDI data as described below.

HDF5 is targeted for scientific data. It is developed and supported by one non-profit company that gets both federal and industrial funding. Libraries and support are available for many computer languages used for dealing with data, but perhaps not all are “well-supported”. I will say that HDF5 files are not human-readable, and cannot be read without the HDF5 support libraries. In addition, it is possible to corrupt HDF5 files in a way that is completely unrecoverable. For archiving and communication purposes this is not a huge concern, as those use-cases imply that a backup or “standard source” will be available. But, this may be a concern when creating and using HDF5 files.

As with Sqlite, XML, JSON, HDF5 is meant to be general purpose and does not have a Schema. The NeXuS project is designed to add domain-specific Schema to HDF5 (and other formats, but mostly HDF5 these days) for scientific data from neutron, X-ray, and similar user facilities. NeXuS is meant to define schema written and supported by the scientific community at these user facilities with the same basic aim as the goals described here: to provide standardized formats (HDF5 hierarchy + clearly defined schema created by the relevant community) to better share and communicate complex scientific data. There are some support libraries and applications for validating NeXuS files.

Some synchrotron facilities are beginning to move toward expecting (if not requiring) that NeXuS be an available format for data.

With the criteria given above, I think it is reasonable to explore HDF5+NeXuS as a format for sharing multiple XAS spectra.