.. include:: _defs.rst

.. _multispec_sect:

Multi-Spectra Data Formats
-----------------------------------------------------------------------------

There are a few data formats that can be considered for holding multiple
spectra.  We will describe a few options here.  The list echo the discussion in
:cite:`XASDataFormats`, but may not be comprehensive.  If other alternatives
are proposed, they should be considered.  Some important criteria to consider
for these formats are:

   1. How well do they map onto a single XDI file?  Can 1000 XDI files be
      extracted from a container of 1000 XAS spectrum? How much work would that
      be for a non-programming scientist?
   2. How efficient are these formats at storing data?
   3. How tied to one library or application are these files? How well
      supported and long-lived is that application and support expected to be?
   4. How likely is it that the data from these files will be easily
      extractable in 50 years?

Most of the format discussed below are meant to be general-purpose containers
of data, and so would require a "Schema" or set of tags for data and metadata
to be defined.  In order for criteria 1 to be met, this should closely match
that of XDI.  It may be important to have tools to validate multi-spectra
files, and potentially guide corrections to ensure compliance.


Zip file of XDI files
~~~~~~~~~~~~~~~~~~~~~~~~

For the limitations given above, XDI does not *prevent* 100 data columns in a
plain text file.  A simple Zip file of 1000 XDI data files is a perfectly
reasonable way to contain and transmit multiple data files.  For the criteria
given above, this approach has very favorable answers.

The main drawback I see for this approach is that it still requires conversion
of "Raw" beamline data files into a format that follows XDI conventions very
closely (with fully compliant tags and columns with names exactly matching
"i0", "itrans", "ifluor", etc).  If we conclude that "Zip file of XDI files" is
OK, a beamline or facility that collects plain text files that are XDI-like
could simply conclude that a Zip file of as-collected text files is
satisfactory.

While XDI is fairly simple and has a library for reading and validating its
format, it is reasonable to admit that it is used only within the XAS
community, and is not yet used uniformly within that community. Using this
format (or probably any other format) may really want improved validation
software, including for multiple spectra, and some thought about support within
the XAS community.



Sqlite
~~~~~~~~~~~~~~~~~~~~~~~~

Sqlite was proposed in :cite:`XASDataFormats` as a potential format for multiple
spectra.  Indeed, I admit to being in favor of Sqlite for some use cases.
Sqlite does require a single library and application to read the format, but it
is ubiquitous and extremely well-supported and portable. Still, extracting data
from an Sqlite file requires use and knowledge of SQL.  SQL databases are
general-purpose ways to store complex data structures, but are often not
trivial to work with: the use of tables and relations can be somewhat confusing
even for people trained in their use.  In addition, Sqlite by itself is not
really designed or well-suited to storing numerical arrays of data.

I would not recommend using Sqlite as a standard format for XAS data using the
criteria listed above.


XML, JSON, etc
~~~~~~~~~~~~~~~~~~~~~~~~~

Like XDI, XML is a plain text format.  It is highly and rigorously structured
and capable of holding complex and nested data structures in a hierarchical
form. Despite being plain text, it is not really designed to be human-readable.
JSON, and host of other formats and markup languages (YAML, TOML, etc) are
similar to XML, in that the can hold arbitrarily nested data structures in
plain text files.  These are generally somewhat easier to work with and
sometimes more human-readable than XML.  All of these formats are in
wide-spread use, and have multiple libraries to support reading, writing, and
validation.  As with Sqlite, none of these have especially good support for
arrays of numerical data.

Though in common use for many purposes, I would say that these formats offer
little advantage over the other formats discussed.



CIF
~~~~~~~~~~~~~~~~~~~

Like XML, JSON, etc, CIF is plain text format capable -- at least in principle
-- of holding complex and nested data structures in a hierarchical format.  It
is used within the scientific community to hold crystallographic information,
which does indeed contain complex data structures.  It is not used outside of
the crystallography community, and libraries to work with CIF files have not
had a great history of support.

CIF can support multiple data tables in a file, but with an unusual syntactical
approach to hierarchical formatting that can be confusing and somewhat fragile.
While CIF is used for crystal structure data, it is not widely used (as far as
I know, it is not used at all) for storing or transmitting arrays of primary
experimental data.  I believe there is not a standard "schema" for experimental
data (indeed, in my experience, Schema for CIF are not always clearly
established or followed for crystallographic data).

Although CIF was proposed in :cite:`XASDataFormats` as a potential format for
multiple spectra, I think it has few advantages over other approaches.


HDF5 and NeXuS
~~~~~~~~~~~~~~~~~~~~~~~~

HDF5 was proposed in :cite:`XASDataFormats` as a potential format for multiple
spectra.  This format explicitly supported multi-dimensional arrays of
numerical data such as those collected at synchrotrons, and is in wide use at
synchrotron facilities for collecting large datasets for imaging, scattering,
and spectroscopy experiments.  Notably, it has very good support compressions
of numerical data.

HDF5 uses a hierarchical format much like a computer file system, with Folders
called Groups that contain Datasets and other Groups. Though perhaps not
elegant and formally complete, this gives a familiar and easy to navigate
approach to accessing multiple spectra and arrays of data. Metadata is
supported for all Datasets and Groups.  Compared to XML, JSON, etc, storing
some complex non-numerical data structures that are not numerical arrays can be
a bit cumbersome (perhaps using JSON-encoded string of a keyword/value
dictionary) but is not too bad, and should not be considered a stumbling block.
As it turns out, this mixed approach was not needed when mapping XDI data as
described below.

HDF5 is targeted for scientific data.  It is developed and supported by one
non-profit company that gets both federal and industrial funding. Libraries and
support are available for many computer languages used for dealing with data,
but perhaps not all are "well-supported".  I will say that HDF5 files are not
human-readable, and cannot be read without the HDF5 support libraries.  In
addition, it is possible to corrupt HDF5 files in a way that is completely
unrecoverable.  For archiving and communication purposes this is not a huge
concern, as those use-cases imply that a backup or "standard source" will be
available.  But, this may be a concern when creating and using HDF5 files.


As with Sqlite, XML, JSON, HDF5 is meant to be general purpose and does not
have a Schema.  The `NeXuS`_ project is designed to add domain-specific Schema
to HDF5 (and other formats, but mostly HDF5 these days) for scientific data
from neutron, X-ray, and similar user facilities.  NeXuS is meant to define
schema written and supported by the scientific community at these user
facilities with the same basic aim as the goals described here: to provide
standardized formats (HDF5 hierarchy + clearly defined schema created by the
relevant community) to better share and communicate complex scientific data.
There are some support libraries and applications for validating NeXuS files.

Some synchrotron facilities are beginning to move toward expecting (if not
requiring) that NeXuS be an available format for data.

With the criteria given above, I think it is reasonable to explore HDF5+NeXuS
as a format for sharing multiple XAS spectra.