Distributed Cross-Matching - a discussion document
Introduction
One of the more elusive goals of the Virtual Observatory has been to find a way of cross-matching sources in different source catalogues even when they are located at different places. The principal (though by no means only) criterion for cross-matching is the approximate coincidence of celestial coordinates, which in database terms is a spatial join. We now understand pretty well how to do this efficeintly when the two catalogues are ingested into the same DBMS, provided they support spatial joins and spatial indices. Not all DBMS do so, but many of do, including one Open Source product, PostgreSQL. There is also a way of performing a spatial join using a DBMS which only supports simple B-trees, but it is much more complicated (because one has to cope with positions which cross pixel boundaries) and cannot really be recommended. The messy details are described in
SkyIndexing and
SpatialIndexing.
One way of performing the
distributed join, therefore, is simply to copy one catalogue to the site holding the other, ingesting it into the DBMS, creating any necessary spatial index, and doing the spatial join in situ. There are, however, several reasons why this is not the ideal solution to the problem:
- Sometimes even the smaller table can be quite large, up to tens of gigabytes in size, so the operations of transporting it and ingesting it into another DBMS are time-consuming and error-prone.
- Possibly neither table is stored within a DBMS supporting spatial joins.
- It is important to have the master copy of each large dataset properly curated, so that any necessary updates can be applied. Every time a complete copy is taken there is the potential for errors to creep in, or for it to get out of date.
- The production of large source catalogues takes a lot of effort, and while Data Centres are generally willing to allow users to search the resulting catalogue, they may not be willing to provide complete copies, because of the effort involved, and because they fear the dilution of the value of their intellectual property.
The SkyNode System of JHU
If the cross-matching operation only involves a relatively small area of the sky, the data volume is modest, and the inefficiencies of a join over the network might be tolerable. Having realised this, the SDSS Team at Johns Hopkins University built the
SkyQuery system. This is, as far as I am aware, the only system which supports a distributed spatial join of astronomical tables. SkyQuery is described in a number of documents on that web-site. The first three sky nodes are:
- SDSS-EDR: Sloand Digital Sky Survey - Early Data Release, located at Fermi Lab in Chicago.
- 2MASS: a section of the 2-Micron All-Sky Survey located at Cal Tech in Pasadena.
- FIRST: a local copy of the VLA Faint Images of the Radio Sky at Twenty centimeters - located in Baltimore.
The SkyQuery system uses Web Services interfaces (SOAP and WSDL) but takes advantage of the fact that the same DBMS (Microsoft SQL Server) is installed at all SkyNode sites. The scripting makes use of Microsoft's .NET infrastructure and C# language.
It is interesting to examine the system in some detail, to see if it can be extended to other platforms and DBMS. The document
SkyQuery: A Distributed Web-based Query Service for Astronomy by Malik, Budavari, Szalay and Thakar shows a couple of simple examples of queries, and explains how they are handled. The first one is this (which I have reformatted with numbered lines for clarity):
SELECT p.g, t.j_m, p.type -- 1
FROM SDSS:photoprimary p, TWOMASS:photoprimary t -- 2
WHERE XMATCH(p,t) < 3 -- 3
AND AREA(175, 0.5, 5) -- 4
AND p.type = 3 -- 5
AND (t.j_m - p.g) < 0.2 -- 6
AND t.j_m < 20 -- 7
AND p.g < 21 ; -- 8
- Line 1 specifies the columns of the output table.
- Line 2 names the tables to be searched: since there are two of them, this implies a join of some sort.
- Line 3 gives the cross-match criterion: the XMATCH function can take any number of table names (in fact aliases) as arguments, and the the "< 3" specifies a 3-sigma match criterion, implying a probability level if the error distribution is known.
- Line 4 specifies the area of sky to use:
AREA(175, 0.5, 5) means an area centred on RA=175.0 degrees, DEC=0.5 degrees, and of radius 5 arc-minutes (or possibly arc-seconds according to one document).
- Lines 5 to 8 specify further filtering to be applied to the results, not that some apply to one input table, some to the other, and one (in line 6) involves columns from both tables.
The
AREA specification is absolutely vital in these queries, as it reduces the number of rows to search in the largest table (2MASS) from about 500 million to (in this example) under a hundred. The system was designed for use on smallish areas of sky and is at present not suitable for use on very large areas of sky.
The query is analysed and the first step is to determine the maximum number of rows that can possibly be involved in the match at each site. It does this by sending to the 2MASS site a query like this:
SELECT COUNT(*) FROM TWOMASS:photoprimary t
WHERE AREA(175, 0.5, 5) AND t.j_m < 20 ;
And sending to the SDSS site a query like this:
SELECT COUNT(*) FROM SDSS:photoprimary p
WHERE AREA(175, 0.5, 5) AND p.type = 3 AND p.g < 21 ;
If the former returns the smaller number of rows, then it is more efficient to transfer TWOMASS data to the SDSS site, than vice-versa. Once this is decided, the next steps are:
- Run a suitable query on TWOMASS to extract the required subset of rows and columns
- Copy the results across the network
- Ingest them into a temporary table on the SDSS database.
- Execute the join between the two subset tables.
The authors note that because the
COUNT(*) queries have just been executed, the required rows are likely to be still present in the disc cache, so the subsequent subset extraction and join runs rapidly. Note that the SkyNode system runs on a DBMS without a spatial indexing system, so that HTM pixel-codes are used to implement the AREA function, but these are perfectly adequate for simple cone-searches, which is all that is required here.
Spatial Joins with Postgres
I think a similar system could be set up within Astrogrid without too much difficulty. We have already been testing spatial joins using Postgres. One of the add-on packages is called
dblink and it allows one server to access data located on another. In principle the data could be extracted from some other brand of DBMS: there are some signs of work on an emerging standard called SQL/MED, which stands for Management of External Data, but is still work in progress, and at present Postgres can only access data from other Postgres servers.
I have experimented with dblink between servers in Leicester and Cambridge. The system works as expected, but joins over the network are rather slow. With both tables resident on the same server, one can perform a spatial join of the whole of the 1XMM catalogue with the Northern section of 2MASS in about 4 seconds. Doing the same join with 1XMM data resident at Cambridge and 2MASS in Leicester takes 161 seconds. This is probably still less time than it would take to perform an explicit dump of the 1XMM table at Cambridge, move it across the network to Leicester, ingest it into the DBMS, and perform the spatial join in situ, but perhaps not a great deal less. We could obviously do a similar operation using XML data formats and SOAP messages, but the overheads this imposes would certainly slow the operation down still further.
If all sites were to adopt Postgres, a distributed cross-matching facility could be provided without much difficulty, but this might not be politically acceptable.
Generic Spatial Joins
If we want to produce a cross-matching solution which is not DBMS-specific and is based on Web Services, then a number of problems need to be tackled:
- We need a query analyser somewhat like that of JHU which produces COUNT(*) queries to estimate the data volume involved in each subset.
- We need a standard format for the data extracted from one DBMS which is to be ingested into another. The current de-facto standard for this is CSV (character-separated value), but the details differ from one DBMS to another, especially in the representation of nulls. We could use VOTable as the transfer format as we already have ways of producing such files from some DBMS, and the upload of VOTable files is obviously needed for other purposes as well. The SQL INSERT command is fairly standard but slow; most systems have a much faster way of loading bulk data, but the commands are not at all standard.
- The DBMS needs to give permission for the creation of a new temporary table, and for the ingestion of new data.
Query Language Issues
The AREA function in the SkyNode examples has changed slightly into the REGION function of ADQL - the main change being the addition of shape and epoch parameters.
The XMATCH function in SkyNode queries may also need to be modified. At present it has the appearance of a function which returns a sigma level, which can be compared with some limit (< 3 in the example above). This is obviously something that is specially interpreted by the query analyser. It seems to me that a function should include the match threshold as an additional argument, maybe the first. So that one would write something like
XMATCH(3, p, t) which would be a Boolean function.
Although astronomers are often happy to work in terms of sigma levels, where sigma represents a standard error, this seems unsatisfactory as a measure of goodness of positional match. It would be better, I feel, to use some measure of probability or likelihood.
Queries such as
t.j_m < 20 AND p.g < 21
are only feasible if the user knows the units of the
j_m and
g columns. There they are both magnitudes, which are more-or-less unitfree, but one can easily envisage similar filtering functions using fluxes which might be in milli or micro-Jansky, or proper motions, where the units are vital. This problem has been raised elsewhere; solutions are likely to involve one of the following:
- All queries are posed in standard units (e.g. all proper motions in radians/second?)
- The user is provided with an easy way of finding the units used in each table, and has the responsibility of specifying the constants accoudingly.
- The user is required to post units with any quantity, e.g.
AND proper_motion > 1.234e-13 [radians/second] and the system, accessing the relevant metadata, transforms them to that required for each table.
Unfortunately we may have to deal with queries where values are compared from tables using different units, e.g. the clause from the previous example which said:
AND (t.j_m - p.g) < 0.2
This is no problem if magnitudes are involved, but suppose the we wanted to combine radio and X-ray flux densities, and they used different units? Solutions (1) and (3) above will work (I think) but (2) will not.
Comments
I have created a
Forum topic for comments or feedback.
ADQL
ADQL is the Astronomical Data Query Language, a draft standard of the IVOA. This is an SQL-like query language which may be translated into an XML document for passing to astronomical data archive sites. It is described in a document of only 8 pages
IVOA Sky Node Interface. This document is short because the gory details are specified only in an
XML document linked from it. Since I think it desirable to have a human-readable description of the main elements of ADQL syntax, here is my attempt at it. The basic syntax is this:
SELECT listOfColumns
FROM listOfTables
WHERE selectionExpression
GROUP BY groupExpression
HAVING havingCondition
ORDER BY orderCondition
The syntax is based on that of SQL, but with some limitations. The listOfTables for example, must contain a table alias for each named table, and these tables must be used in all column expressions to properly quality each column name. For example:
SELECT x.ra, x.decl FROM rosat AS x WHERE REGION('CIRCLE 2000 123.45 -12.34 0.01');
Some additional material can be found in the wiki page
ThoughtsOnXMATCH
--
ClivePage - 11 May 2004