See also
WorkingDataFormats and
BinX
Just found this overall description of
scientific data formats
Why Bother?
Moore's Law implies that we don't need to worry about XML's verbose or processor-hungry representation; but the VOs datasets are large and seem to be getting larger faster than CPU or network bandwidths. The
W3C group
considered the issue but are not likely to produce a solution soon or at all.
So what do we want?
A
small Java package that implements input and output streams. The output stream takes tags with attributes and values and writes them out in binary form. An implementation of input stream takes the resulting form and reads the binary form, spitting them out in ASCII form so the file can be accessed exactly like an ASCII file, feeding it through a SAX parser, etc.
We can then generate a file with considerably less processing power and occupying a small fraction of the disk space. We can pass it around our components as if it was any other XML file. We only need to convert when the user asks for it - and even then the delegate can expand it.
We need to handle any XML file, with or without schema.
Isn't it hard?
I don't think so:
Noddy Solution
Off the top of my head, we implement an
OutputStream that has writeDOM, writeTag, writeValue(String) writeValue(double) etc methods. It writes to a
DataOutputStream , using its already-optimised binary write methods. The first time a name (eg XML tag name, or attribute name) is encounted, it is indexed internally, and written out as a null-terminated string followed by the index number. The next time that tag name is encountered the index number is used. Binary flags would mark first-time strings, subsequent index numbers, end of tags, attribute names, value types, etc. Numbers can be written out as numbers.
Hmmmm; this implies that a string that 'looks' like a number will be stored as a number. This only matters for real numbers; how 'real' is the conversion back? It doesn't apply to numbers like telephone numbers as they'll get restored as integers (and if they have spaces they will get stored as text)
An input stream can reverse this to ASCII. Another might offer the same but in binary form - eg getTag, getInt, getReal, etc.
We can even validate as we go. So converting from a datarowset to a votable, to be piped to a data warehouse to be loaded into another table can be done entirely in binary, but using the VOTable structure. So if at any point a tool in the pipeline cannot cope with the binary form, it is reversed into VOTable XML and on it goes.
As far as I can tell, this process is entirely reversible, needs no schema and can cope with any XML file. There is no processing to/from Strings. The result is not the
most efficient but that's not what we're after.
We
could write a
FilterOutputStream that would interpret 'incoming' ASCII XML and do the same with that, but I'm not sure about this - if you've already spent the processing power creating the ASCII then you might as well just ZIP it.
FAST Web Services
The java community is looking at this problem; the brand name is 'Fast':
http://java.sun.com/developer/technicalArticles/WebServices/fastWS/
WAP XML
There is already a 'compressed' form of XML used by WAP services:
http://www.w3.org/1999/06/NOTE-wbxml-19990624/
Have been looking at it:
http://www.w3.org/2003/07/binary-xml-cfp.html
ZIP
Just zip the stream up as it is being created, eg using
http://java.sun.com/j2se/1.4.2/docs/api/java/util/zip/ZipInputStream.html
Reduces network bandwidth but increases processing, as each 'original' binary number is converted to a string and then packed to binary. At the other end the string is unpacked, then parsed to binary. If CPU speeds increase faster than network speeds - ie the bottleneck is the network even after the quadruple processing has been done - then this might be OK.
--
DaveMorris - 17 Mar 2005
--
MartinHill - 01 Sep 2005
I don't think
BinX can do this without some kind of schema;
BinX is about representing binary in XML rather than the other way around. So this means that we still have to do the conversion between binary and string and back to local binary again. Under investigation, see the
BinX page.
DFDL
Data Format Description Language, following on from
BinX. See
here. Another Over-generalised solution (I think).
Common Data Format
Used by solar people for years, apparently:
http://nssdc.gsfc.nasa.gov/cdf/cdf_home.html
http://spdf.gsfc.nasa.gov/istp_guide/istp_guide.html
A heirarchical data format (?) with an XML form. Not sure though if it's so general that you have to layer info over the top to describe where to find passbands etc.
HDF - Heirarchical Data Format
Binary data form from
NCSA. Used in many science disciplines. No pure java API. Essentially a structure (like FITS and 'raw' XML), would have to define requirements (eg keywords required defining WCS for images) not sure how (viz schemas).
HDS - Heirarchical Data System
I think some ATC people use this.
--
MartinHill - 08 Mar 2004