| Chip | File-size |
|---|---|
| 1 | 443 520 |
| 2 | 512 640 |
| 3 | 466 560 |
| 4 | 495 360 |
| Total | 1 918 080 |
Liefke, H. & Suciu, D, 2000, Proc. ACM SIGMOD International conf.on Management of Data, pp. 153-164 "XMill: an efficient compressor for XML data"and the WWW site http://www.research.att.com/sw/tools/xmill/ presents code, documentation, etc. Peter Buneman of Edinburgh's Division of Informatics brought this approach to my attention. He ran some toy model tests of his own, in which he took a 672kB XML document of baseball statistics and reduced it to 1.3kB, and took a 7.6MB XML file concatenating Shakespeare plays and reduced it to 31kB. Basically, the idea is that you decompose the original XML document into several files which encode its structure without using all the repeated text in the tags, thereby cutting down the data volume greatly. You get a "tag map" file which lists the tags and assigns a number to each, a "skeleton" which lists the order in which the tags arise in the document, and a collection of data files which list the text that go into each tag, and whose names encode the path through the tree of tags to get to them. All these files can then be gzipped themselves to reduce the data volume further. It does look useful for large tabular datasets. What Peter and I have been discussing is its possible use in data-mining. You might extract the attributes of the objects of interest from a set of databases in the VO, then send them in VOTable format to some machine to data-mine, and that it might be better to store the data in compressed, vectorised XML on the data-mining machine, rather than in an RDBMS. One particular reason why this might be useful is that data-miners are going to want to ask things like "plot me X-ray luminosity versus radio power for objects more than 3 sigma redder than the median in V-K" or some such, and statistical operations like means, medians, etc, are much faster to do if the data are stored in column order (i.e. "vectorised") than in row order, as in an RDBMS. BobMann - 20 Jun 2002 I repeated the test using xmill (pre-compiled Linux binary on Red Hat 7.3) instead of gzip. The compressed XML cames out at 1 362 437 bytes when using the default options on xmill. That's no significant gain over gzip. By setting "the compression factor of zlib" (whatever that is) up to 9 from its default of 6, as allowed in the xmill interface, the size went down to 1 343 842 bytes. This is still 90% of the gzip'd size and a compression factor of 6.1 with respect to the original XML. I leave it to the reader to decide whether a 10% improvement in size is worth using special software. Why does xmill offer so little improvement over gzip? Xmill works by rearranging the tags in the XML, storing similar tags together such that the actual compressor -- which is gzip in xmill's default configuration -- has a better chance of seeing and using patterns in the data. In a big VOTable, the tags are already sorted: the whole thing is a sea of cell tags with a sparse garnish of row tags. Xmill can move the row tags to optimize compression, but this doesn't achieve very much. Where xmill might do well is in compressing a large collection of VOTable headers, saved without DATA sections. Here, the greater depth of structure and the diversity of content could defeat a simple compressor and xmill could mend things. -- GuyRixon - 21 Jun 2002
![]() |
Click here for the AstroGrid Service Web |
This is the AstroGrid Development Wiki |
|