Background

The astrogrid project will (hopefully) consist of a wide variety of web/grid services, serving up data to be combined and processed, and ultimately displayed to the user.

It is beginning to appear that this data will be served in VOTable format; however I get the impression this is the case only 'by default' - because it is the only standard format we have to hand, rather than because it is suitable.

VOTable is not enough; it is very verbose, processor intensive and cannot be randomly accessed. While Moore's Law might help with the first two, our data sets are likely to get larger faster than processor or network bandwidth.

So this document is a revisit to both the VOTable format, alternative formats, and what we are trying to do with our data exchanges, and proposes some solutions to the issues raised.

Assumptions & Limitations

  • concentrating here on solutions for Astrogrid1, with only a nod to the long term future ("Hello future")

Requirements

What do we need in our data formats?

  • Nice easy to use formats for display, processing, applying tools to, etc

  • Fast, compact, random access data formats, mostly for processing & comparing large data sets.

I see these as two very different requirements - for two different formats.

  • Intermediate processing will involve more than just the data format - eg, how are we going to handle distributed joins?

VOTable

VOTable is XML, so let's have a look at that first:

XML Pros Cons Comments
ASCII Based Human readable. Great for debugging, writing new utilities, future-proofing, etc. Big. Very Big. So enormously big that you'd need a very large virtual stick to shake at it. We know this - we just seem to forget sometimes. One of our 'requirements' is to reduce bandwidth.
ASCII Based Wide Platform. All languages can handle strings, and most now have XML processing libraries Slow. Strings require a lot of processing to get to numbers and back, and a lot of our work is number based. We've got big databases we want to do difficult, slow things with - such as joins.
Loose structure Flexible - can freely describe many different data types and relate them to each other, in the one file. Not random access. How do we view 'windows' on the data? How do we change single elements in huge data sets? Different purposes
Has Schemas Can move data validation to the programs building the data, rather than the programs using it.   All the schemas do is make up for XML being a 'metadata format' not a 'data format'. With schemas you can make a data format.
Been around a bit Tools exist. HowTo docs exist. Skills exist. Idiots Guides exist. Lots and lots and lots of stuff to read... Generally a big plus

A few comments about VOTable:

  • 2 dimensional only. Astronomical data (like a lot of data!) is only superficially 2d because of the limitations of the databases used.

  • The meanings of the values are position-dependent. When using XML, you should be able to take an element, and understand what it means. If you look at the parent tag of that element, you'll have more information about the context of that element, and so on up the chain. However, the meaning of a value in a VOTable is dependent on it's position amongst it's peers, and this is a Bad Thing. And bug-prone. And not at all robust.

  • Extends poorly - the existing structure makes it hard to add new information - for example, even to add passband information to the header in a useful way!

  • It's an early XML days version. Designed when XML was not a commonly used; so it's not 'good' XML, although it's a very good attempt (see the rather poor XML used with ACE...!). Most things need to be written twice (so 'they' say) and this is a good example.

Binary Formats

BinX

It appears that BinX may be usable soon. However BinX may be too flexible for this definate application - and there may be a lot of work involved in parsing it and writing code that can extract data using it. And it has most certainly not been around a bit - there are likely to be all kinds of teething problems using it for Astrogrid1.

MySQL

We could just pick any old database with a java API (does MySQL have one?) and use that for Astrogrid1, ensuring that it's part of our distribution.

JDBC

We don't really care about the format of the data, we just want to be able to access it and change it. So we could say that any table that is produced needs to be 'JDBC-compatible' - which implies that the file itself must have some description about what database application to apply to it (possibly just in the extension, as 'standard'). It would mean that accessing any such table (presumably through gridSpace/mySpace) would require the database application present, and moving it between gridSpaces might also require imports/exports to whichever database applications are present at the target gridSpace.

FITS

Already exists, lots of tools and libraries available. Some services return data in this format naturally.

Are there tools/applications that allow us to, for example, apply SQL to FITS data files?

There are no readily-available validators for FITS files (eg, are the column headings valid, are ucds specified, are certain required keywords present in the header, etc) but it's not a difficult thing to write one - it could even be based on a schema.

Comments on VOTable links to FITS

VOTable has recently been extended to allow pointers to FITS files. I don't think this is a solution:

  • The FITS file itself exists out of context of the VOTable, so:
    • There is no check that the VOTable really does correspond to that particular FITS file (eg FITS file can be replaced with a different one and it will all go horribly wrong)
    • There is no more natural validation than you get from just having the FITS file; you can validate the VOTable but this does not validate the FITS file.

  • VOTable isn't designed to describe other tables; it becomes an empty table with FITS holding the data - so why not just have a FITS file.

  • The whole point (?) of VOTable was a 'standard' easy-to-read format we could build tools for. Adding FITS files means those (XML-based) tools now need to cope with a binary data format.

  • You do get the added information about ucds in the VOTable. But it's a very longwinded way of going about it; we would be much better off adding this information to the FITS header.

So having the links seems to get the worst of both; requiring FITS tools with the added claggage of XML.

I may be missing the point here (what end? what stick?) as I seem to remember bringing this up ages ago when the link first appeared, but I don't remember any sensible answers. Probably because I didn't ask a sensible question.

Also, the first point is a problem we're going to hit anyway; for example an image server might want to serve some XML descriptive stuff along with the image. So we need to solve that... (checksums?)

Proposal

Random-access format

For random access data, it looks as though we can settle on the data warehouse approach introduced by Kona, Guy, Elizabeth and Clive. ie, if you need to do joins, etc, then load your data up into an RDBMS and use that to do your analysis.

Binary XML

There has some recent W3C conference to consider a binary form of XML. This is not likely to go anywhere because they are mostly considering messaging rather than large datasets that the VO will be dealing with. Moore's Law is not sufficient, and we need a solution faster than the W3C admit they will take considering it.

It's not that difficult

It may be that BinX can supply this but from what I have read of the developers API it is oriented towards representing binary formats as xml rather than the other way around

Alternatives to VOTable

VOTable -> VOCatalogue.

Lets have a new VO XML format, that does all the things we need it to do now that we know better. We can make it a 'proper' flexible XML file rather than an XML representation of a 2d binary table.

For example, we could describe catalogue data as follows:

<catalogueObject type='star'>
   <magnitude>12</magnitude>
   <ra>blah</ra>
   <dec>blah</dec>
</catalogueObject>

<catalogueObject type='cluster'>
   <flux>12</flux>
   <ra>1.222</ra>
   <dec>blah</dec>
</catalogueObject>

This will be bigger than VOTable, but then if we are using XML at all we care little about size. Tags can include attributes such as ucd and/or unit (and eventually our new universal data descriptors...)

We can then surround the above with, say, a tag that describes the source of the catalogue, passbands, etc.

We can also link from one object to another, eg where the same object is listed from different plates. Or we can link or group objects that are related by the extraction process.

Other VOResults

At the moment we've only been really considering stellar catalogue data and images. I'm not sure about other data sets - what about event-based stuff (solar?) or planetary? Thus the renaming to VOCatalogue above (rather than eg VOTree), as I'm sure we can expect other result sets to be added to the VO capabilities.

XML snippets

We could do with settling on some XML snippets that deal with particular 'shapes' of information. For example, passbands might be the standard MAG_B, MAG_K, MAG_R, etc, or they might be a central frequency with a FWHM, or they might be a shape given by a set of points. We will probably want to use passbands in a variety of situations (VOTable/VOTree, ACE config files, etc) and so it would be useful to have a standard XML format, along with a set of Java classes to process them.

-- MartinHill - 26 Jun 2003

Topic revision: r3 - 2005-09-01 - 11:11:48 - MartinHill
 
AstroGrid Service Click here for the
AstroGrid Service Web
This is the AstroGrid
Development Wiki

This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback