Registry Query Language

Outline of the Problem

I am writing this wiki page because there seems to be an unintentional overlap of functionality (bad) or maybe a convergence (good) between the VOQL and the Registry working groups of the IVOA, as became apparent at the Interop meeting in Cambridge (Mass) in 2004 May. Comment on this note should be sent to the registry mailing list of the IVOA.

Just to summarise the potential problem:

Resolution 5 of the second Registry session says:

    An IVOA Registry must support a Full Query (interface tbd) using the WHERE 
    clause of the ADQL schema with fields specified using non-namespaced XPath 
    directives (simple, ie no [...] options), and returning full resource entries.

Meanwhile the SkyNode interface version 0.7.4 section 5.2.1 (basic sky node) has this text in sub-section QI-8:

    SkyNodes should accept the VOQL equivalent of Standard SQL-92 Metadata queries.  
    These queries include (expressed here as SQL but that would be ADQL/XML coming 
    to the node):
        SELECT * FROM TABLES
        SELECT * FROM COLUMNS
    We need to decide how much of the entire set is relevant and required.  Since 
    most DBs handle this we could just say all of it.  The syntax may be a little 
    different in each database, i.e. in SQL-Server this is:
        SELECT * FROM information_schema.tables
    While ORACLE does not support Information Schema (as far as the author can tell).

The conclusion seems clear: the general type of query will be of the form

   SELECT results   FROM source   WHERE conditions ;

I see that the various registry teams have made good progress in that prototype registries are just starting to come on stream, but a number of practical issues come to mind connected with populating and querying the registry, to which I have so far seen no answers and little discussion. There also seems to be a shortage of use-cases for the Registry, which I find troubling. Most of the Science Cases which we produced earlier are not much help as they were written on the assumption that the users were relatively expert and knew what resources to use in their invesitgations.

There is also the current divergence of opinion between the AstroGrid and US-NVO teams on the level or granuality of the metadata to be stored in the Registry. Perhaps some of the problems arising from this might be resolved by considering the practical issues.

An Example of a Tabular Query

I would like to bring the discussion down to earth (if that isn't the wrong metaphor in astronomy) by giving an example of a query that the "VO system" ought to be able to handle. This might arise from a search for high proper-motion objects (e.g. nearby stars) in a small region of sky. This may not be a very typical VO query (is there one?) but is a simplified version of a reasonably realistic case that the VO should be able to handle. Here is a simple example I've given before:

   SELECT * FROM sometable WHERE REGION('CIRCLE J2000 12.3 45.6 0.1') AND
      proper_motion > 100 ;

Finding the Table Name

The first thing to work out is how the user finds "sometable". In some cases the user will have the expertise to know already, in which case this section becomes irrelevant. But we want the VO to handle queries from non-experts, or even the expert whose knowledge is incomplete.

In general the user will send a query to the Registry to find resources which probably (but not certainly) have the desired information. In this example the user wants tables with proper motions and with coverage of the specified part of sky. The syntax obviously needs to be thought about more carefully, but here's a stab at what might be formulated, in a the spirit of ADQL/sql:

   SELECT table_name 
      FROM vo 
      WHERE Type = 'Catalog' AND
            Coverage.spatial = circle('FK5', 12.3, 45.6, 0.1)  AND
            UCD(pos.pm.ra, pos.pm.dec) ;
Notes:
  1. It is not clear what goes after the SELECT keyword - obviously you want table names, what else could go here?
  2. I have set the argument of the FROM as just 'vo' to make this query as general as possible. Again what else might one put here? Or is this where one specifies a particular Registry, rather than leaving the system to choose one?
  3. The syntax of the "Coverage.spatial" element is relatively obvious from the Resource Metadata document. At present the syntax of a similar query in VOQL has a single string argument with space-separated items, which means that one keyword (REGION) covers all shapes, but as the number of arguments varies they are wrapped in a string. We need to adopt one syntax or the other for both types of query, it seems to me.
  4. I have assumed that UCDs version 1+ will be used, and that there will be a function allowing the user to list all the required UCDs. Perhaps some flexibility is needed here, e.g. to allow alternatives, or saying you want to see tables having 4 out of 6 of the following UCDs?
  5. There is already provision in the Registry spec for the storage of a list of UCDs so that queries of this type can be handled purely by the Registry. It is not clear whether the Registry should know the mapping from UCDs to actual column names: more on this later.

Another possibility is that the user might know that, say, Vizier held the required table(s) and just want to query that. Vizier already has this type of search facility, but using a conventional web form, as do a few other data centres with large holdings. They may well want to implement an interface which is VO-compliant by allowing queries of this form. I suppose the query could then be FROM Vizier rather than FROM vo ?

Finding the Column Names

An important question is whether queries to tabular datasets can use UCDs or whether column names are required. Since column names are not at all standardised, a distributed query just has to be reliant on UCDs. But UCDs are not completely unique in any given table, just fairly unique. (Purists and pedants will say you can't say 'partly unique' any more than you can say 'partly pregnant', but I will ignore them). The designers of VOQL have tried to fudge the issue: normally column names will be used, but UCDs are a possible alternative.

If column names are required, I guess the thing to do is for the system (or the user explicitly) to issue a query asking for all the column names in a table, with other details. Perhaps like this:

   SELECT name, data_type, units, ucd 
     FROM  this_table ;
As Wil points out in the SkyNode specification, SQL 1992 seems to standardize this sort of query, so it is reasonable to expect data centres using a modern DBMS to support it without problems, though no doubt like everything else in the SQL world, theory and practice will diverge somewhat.

If queries can depend upon UCDs then there has to be some way of storing them and mapping to actual columns.

Finding Physical Units

It is fairly obvious that a query can only be formulated like this if you already know the units used for the column called "proper_motion" so that you can make sure that your constant (100) has the same units. Note that:

  1. The REGION section works only because we know that the arguments are in degrees and not, say, radians or hours.
  2. Similar problems will arise when using a WHERE clause involving just about any numerical quantity such as flux, size, exposure time, etc. A few quantities such as red-shift are dimensionless, but most are not.
I have chosen proper-motion because it is a case that really needs to be solved, and because units for it are not at all standardised. A quick check of Vizier shows that it contains 231 tables with columns of proper motions, and that they use 19 different physical units. The International Astronomical Union has pronounced on this matter and strongly recommended astronomers to use SI units. Proper motions are just angular velocities, so the SI unit is radians/second: none of the 231 tables uses radians/second.

This particular problem could be solved merely by devising some method to make the user aware of the units used by each table, which could be discovered by using type type of query given in the preceding section. This does not work, however, if we want to devise a system capable of handling distributed queries: if the same query has to be sent to different tables perhaps resident at different sites, we need some way of specifying the units in the query. In principle we could lay down the law that all systems and all queries should use SI units (as approved by the IAU) but the Vizier sampling showing a vote of 0/231 in favour of SI units, suggests that this would not go down too well in practice. What I propsed to the VOQL working group, and what seemed to be generally accepted (though the details are not at all fixed) is that the query language should allow and indeed encourage the use of units associated with all numerical parameters. So the previous example just becomes:

   SELECT * FROM sometable WHERE REGION('CIRCLE J2000 12.3 45.6 0.1') AND
      proper_motion > 100 [mas/y];

Assuming that the units are milli-arcseconds per year. Various systems for specifying units have been devised by astronomers. The most complete yet succinct syntax is, in my opinion, that devised by François Ochsenbein and colleagues at CDS in the document http://vizier.u-strasbg.fr/doc/catstd.htx section 3.2.

Many of you will have already spotted that this query has been simplified too much. For one thing, nearly all catalogues give two separate components of proper motion along the RA and DEC axes. So the query really ought to be something like this:

    SELECT * FROM sometable WHERE REGION('CIRCLE J2000 12.3 45.6 0.1') AND
      SQRT(pm_ra * pm_ra + pm_dec * pm_dec) > 100 [mas/y];
or perhaps (since VOQL following JDBC has no exponentiation operator, but a POWER function:
    SELECT * FROM sometable WHERE REGION('CIRCLE J2000 12.3 45.6 0.1') AND
      SQRT(POWER(pm_ra,2) + POWER(pm_dec,2)) > 100 [mas/y];
This will take a bit more effort on the part of the query parser, as it has to know how the POWER and SQRT functions affect the units. A number of systems have solved similar problems. Tara Murphy of CSIRO told me about a website that she has set up (still under development) which handles a range of units conversions.

Actually Google™ doesn't do badly here either: try typing "speed of light in furlongs/fortnight" in the search box and it gives you the answer (1.8026175 × 1012 furlongs / fortnight) right away.

It is worth discussing where the units information should go, and indeed where it will come from. No commercial (or indeed freeware) DBMS supports storage of physical units, and nor (more suprisingly) do the astronomer-written DBMS such as BROWSE and WCStools. This will need a new data structure, to be stored somewhere.

Populating the Metadata Tables

What is clear is that to make a fully-functioning VO query system we need to consider storing several attributes for each column in each table:

Attribute Comment
Table name obviously
Column name ditto
Column data type desirable to allow query syntax to be checked
Column physical units needed for most WHERE clauses involving numerical quantities
Column UCD needed to allow distributed queries, and to discover relevant tables
Column display format Provided in many FITS tables: desirable to format numbers legibly
Column range of values Helps users formulate sensible queries
Column nullability Helps users realise when null values may exist
Column description Text which may explain details of value
Column indexed? May be useful in formulating efficient queries

The volume of information involved here is considerable. The worst case, we all know, if Vizier with over 4000 tables and maybe 100,000 columns. But even data centres of modest size have many tabular datasets. For example LEDAS contains around 400 tables, with around 7000 columns, and I suspect many others around the world are comparable in size. In the case of Vizier all the essential information (units, UCDs) are already stored in the system; in most other data centres they are not. At present the user can discover the physical units of columns only with considerable effort - a few systems provide the information readily, in others you have to read the small print of the documentation, if you can find the relevant documentation. Apart from Vizier, hardly any systems provide UCDs systematically. Even with Vizier you have to tick a small check-box and redisplay the query form to get the UCDs shown.

The discussion of how to populate the metadata tables has been rather brief on the mailing list, as far as I can tell, with no mechanism proposed for the acquisition of the information besides some web-forms which data centre managers can fill in. It is quite obvious that this will not work for the quantities of metadata actually involved. I think what we need is a defined data structure for columnar metadata, and some tools to help populate these tables.

Another interesting question is whether there should be a one-to-one correspondence between data tables and metadata tables, or whether for each site (data centre or Registry) one very long table should contain all the columnar metadata for all relevant tables. The latter makes searching faster, since it can be stored in an RDBMS and indices used where appropriate.

Checking Query Syntax

One argument in favour of storing the column metadata in the Registry rather than at the data centres is that this makes checking the queries easier and more efficient: you can make sure that the user simply cannot generate a query which is other than correctly formed, with column names and UCDs which make sense. I am not sure how important it is to do this: most DBMS seem to return error messages pretty smartly if you commit an obvious faux pas such as spelling a column name wrong. I suspect that the advocates of prior syntax checking have in mind the case of complex workflows: if you set up a set of umpteen steps and the 13th step fails because of a spelling mistake, this could be annoying. The counter-argument is that too much checking will cause excessive delay. My own inclination would be to leave checking out for the moment, and just try to get something that works. As SkyNode and other systems allowing almost raw SQL input have shown, users manage to write their queries without making too many mistakes, or else they learn fast.

-- ClivePage - 22 Jun 2004

Topic revision: r1 - 2004-06-22 - 16:10:00 - ClivePage
 
AstroGrid Service Click here for the
AstroGrid Service Web
This is the AstroGrid
Development Wiki

This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback