The most basic way of writing a table in VOTable format involves wrapping every value in cell tags and every row in row tags. This is very desireable for manipulation (e.g. it's easily parsed by XSLT), but inflates the size of the table relative to formats like FITS binary tables. VOTable allows embedded binary files to get round this problem, but the work-around makes the parsing harder.

Jim Grey of Microsoft suggests that although XML is bulky it compresses exceptionally well, since the information density is very low in XML's natural state. This suggests that a suitably-compressed file in VOTable format could be as small as a file of FITS binary tables for the same data-set. At TonyLinde's request, I have tested this.

I started with the four FITS files representing the object catalogue for one observation with the INT Wide-Field Camera: each file has one FITS table holding the objects found on one of the camera's four CCDs. I chose run 220 000 at random, for which the four tables contain a total of 13 505 rows. The FITS files have these sizes in bytes:

Chip File-size
1 443 520
2 512 640
3 466 560
4 495 360
Total 1 918 080

I then collated the four FITS files into one file using the Unix cat command and compressed the resulting file using gzip: after compression, it occupied 966 013 bytes, which is a compression ratio of 2.0.

I now extracted the data-rows of the four FITS tables into TSV format and then added row and cell tags to bring these data into the format they would have in a "longhand" VOTable file. I did not attempt to add the VOTable header. The size of the header is assumed to be negligable compared to the body of the table.

The uncompressed XML occupied 8 233 497 bytes, which is 4.3 times the size of the FITS files. When compressed with gzip, the XML shrank to 1 491 206 bytes, which is a compression ratio of 5.5.

Finally, the compressed XML is 1.54 times the size of the compressed, binary FITS-files.

-- GuyRixon - 18 Jun 2002

Further to this useful test of Guy's, it may be worth considering an alternative approach - called Liefke-Suciu decomposition - in which the original XML document is split into a series of files. The original reference for this is:

Liefke, H. & Suciu, D, 2000, Proc. ACM SIGMOD 
International conf.on Management of Data, pp. 153-164
"XMill: an efficient compressor for XML data"

and the WWW site http://www.research.att.com/sw/tools/xmill/ presents code, documentation, etc.

Peter Buneman of Edinburgh's Division of Informatics brought this approach to my attention. He ran some toy model tests of his own, in which he took a 672kB XML document of baseball statistics and reduced it to 1.3kB, and took a 7.6MB XML file concatenating Shakespeare plays and reduced it to 31kB.

Basically, the idea is that you decompose the original XML document into several files which encode its structure without using all the repeated text in the tags, thereby cutting down the data volume greatly. You get a "tag map" file which lists the tags and assigns a number to each, a "skeleton" which lists the order in which the tags arise in the document, and a collection of data files which list the text that go into each tag, and whose names encode the path through the tree of tags to get to them. All these files can then be gzipped themselves to reduce the data volume further.

It does look useful for large tabular datasets. What Peter and I have been discussing is its possible use in data-mining. You might extract the attributes of the objects of interest from a set of databases in the VO, then send them in VOTable format to some machine to data-mine, and that it might be better to store the data in compressed, vectorised XML on the data-mining machine, rather than in an RDBMS. One particular reason why this might be useful is that data-miners are going to want to ask things like "plot me X-ray luminosity versus radio power for objects more than 3 sigma redder than the median in V-K" or some such, and statistical operations like means, medians, etc, are much faster to do if the data are stored in column order (i.e. "vectorised") than in row order, as in an RDBMS.

BobMann - 20 Jun 2002

I repeated the test using xmill (pre-compiled Linux binary on Red Hat 7.3) instead of gzip. The compressed XML cames out at 1 362 437 bytes when using the default options on xmill. That's no significant gain over gzip. By setting "the compression factor of zlib" (whatever that is) up to 9 from its default of 6, as allowed in the xmill interface, the size went down to 1 343 842 bytes. This is still 90% of the gzip'd size and a compression factor of 6.1 with respect to the original XML.

I leave it to the reader to decide whether a 10% improvement in size is worth using special software.

Why does xmill offer so little improvement over gzip? Xmill works by rearranging the tags in the XML, storing similar tags together such that the actual compressor -- which is gzip in xmill's default configuration -- has a better chance of seeing and using patterns in the data. In a big VOTable, the tags are already sorted: the whole thing is a sea of cell tags with a sparse garnish of row tags. Xmill can move the row tags to optimize compression, but this doesn't achieve very much.

Where xmill might do well is in compressing a large collection of VOTable headers, saved without DATA sections. Here, the greater depth of structure and the diversity of content could defeat a simple compressor and xmill could mend things.

-- GuyRixon - 21 Jun 2002

Topic revision: r4 - 2002-06-21 - 18:03:43 - GuyRixon
 
AstroGrid Service Click here for the
AstroGrid Service Web
This is the AstroGrid
Development Wiki

This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback