Cross-matching Catalogues: Proposed Implementation

Clive Page
2005 January 6

1. Introduction

The federation of datasets has been high on the list of priorities for the Virtual Observatory projects but only limited progress has been made so far.  The scientific importance hardly needs to be stressed: astronomers need to combine information from different wavebands to get an understanding of the spectral properties of objects, and similarly to combine information taken at different epochs to see whether sources are variable.  The Aladin service of CDS provides good facilities for the federation of images, but facilities for federating catalogues are still comparatively primitive.

The term federating means, I guess, that we want to be able to use information from two datasets without necessarily having them present on the same computer.  This is actually something of a tall order, not to say an obstacle to progress.  The mantra "ship the results not the data" is fair enough when there all the data you need to use are resident on a single remote server, but if you want to perform some process on two large datasets, resident on different remote systems, this is of little help.  What you want to do is to transfer the minimum amount of information from one site to another, since bandwidth is usually the scarcest resource in large-scale data handling.

Ideally what is needed is a catalogue cross-matching facility which can handle:
I don't think we can satisfy all these requirements at once.  In particular any project which involves the cross-matching of one large catalogue with another is bound to be a significant undertaking, requiring astronomical and database expertise.  It would be better to concentrate initially on providing a system which can cross-match relatively small catalogues, say up to a million sources (and which may arise from the user's own researches, or be a subset extracted from some standard catalogue) with standard catalogues which come in all shapes and sizes.

1.1 Cross-matching criteria

The principal cross-matching criterion is always coincidence of celestial position.  Of course an exact match is unlikely because there are many source of error in the apparent celestial coordinates of sources: what is needed is a match within the overlap of error-regions.  Existing catalogues seem to come with positional error-regions of four main types:
  1. All sources in the catalogue are assumed to be measured with the same accuracy, so we can assume a circular error region of constant radius (which one hopes to discover by reading the documentation associated with that catalogue).
  2. Sources have individual positional errors typically depending on their flux or magnitude, but these are equal along the coordinate axes, so there is just a single column of error-radius.
  3. The errors along the RA and declination directions are different, so there are two columns, of RA-error and DEC-error, specifying an error-ellipse.
  4. The error region is elliptical but the axis is inclined to the coordinate axes, so there are columns of major-axis-error, minor-axis-error, and position angle.
More complex error regions are found in a very small number of cases, but the concept of error-ellipse seems general enough for the present purposes.  The most serious additional complication is that the units of the errors vary widely, from degrees, radians, to arc-minutes and arc-seconds, or even sub-multiples of them.  This all has to be understood and handled by the astronomer ingesting a new catalogue to the system.  In due course when standards for UCDs and units are fully implemented, it should be possible to handle such variations automatically.

Ideally each source in the first catalogue has exactly one counterpart in the second.  In practice many sources have no apparent counterpart, perhaps because they are too faint to have been detected.  This in itself can be valuable information, so that the user should be given the option to record such non-matches: in database terms this corresponds to a LEFT OUTER JOIN.  The other case, not infrequently encountered, is that there are two or more counterparts found on the basis of the overlap of error-regions: here it is often possible to use other properties of the sources to work out which of them is the most likely counterpart.  Obviously considerable astronomical expertise is needed at this stage. 

The important point is that we want the database system to whittle down the number of possible counterparts from the millions or billions to at most a handful, after which the data volume is small enough that the astronomer can easily apply the necessary scientific expertise to find the single best match.  The latter stage is beyond the scope of the present document.

1.2 Filter and Refine Stages


The computation of the great-circle distance between two points on the celestial sphere involves using several trig functions, which are computationally expensive.  The majority of cross-matching algorithms therefore proceed in two stages
  1. The filter stage finds a small number of possible matches on the basis of approximate positional match, for example when the minimum bounding rectangles (MBR) of the error-regions of the sources overlap.
  2. The refine stage carries out the spherical trigonometry to measure actual offsets to weed out those sources which do not actually have overlapping error ellipses.  If the astronomer wishes, this refinement stage may also compare other properties of the sources such as magnitude or spectral type, to reduce the number of cases in which more than one source appear to match.

2. Possible Algorithms

2.1 Use a Spatial Index

If you need to process spatial data (such as spherical-polar coordinates) then it seems to me to be obvious that you should use a DBMS which supports spatial objects and provides a spatial index.  Many products provide such facilities and have them fully integrated into the product, for example DB2, Informix, MySQL, and Postgres.  All of these except DB2 have spatial indexing based on R-trees, which is simple and requires no configuration.  DB2 uses a multi-level grid file, which requires conversion of the coordinates to integers with a specified range.  My evaluation of spatial indexing in MySQL and Postgres was reported in http://wiki.astrogrid.org/pub/Astrogrid/DataDocs/cross2.htm and showed that it worked well in practice, was reasonably easy to set up, and efficient.  A few other DBMS have spatial indexing as an optional extra, often using code from a third-party.  Unfortunately there are also a few DBMS where there seems to be no spatial support options, and this includes Microsoft SQL Server. 

To provide a cross-matching facility at a single site it is merely necessary to use one of the DBMS with support for spatial indexing, and use the facilities provided to do a spatial join.  It would be very easy to set up a user facility of this type at Leicester, where a number of tables have been ingested into Postgres.  Providing a portable solution would be rather harder, especially as the syntax used to create and use spatial indices varies so much - see Appendix A for some of the gory details.

Unfortunately several astronomical data archive sites use DBMS which do not have spatial indexing, so the options would seem to be

2.2 Sort/sweep Algorithm

An alternative spatial join algorithm was invented (or discovered) by Dave Abel, Drew Devereux, Robert Power, and Peter Lamb of CSIOR as reported in An O(NlogM) Algorithm for Catalogue Matching.  Details of the performance are given in another technical report, Benchmarking Catalogue Cross Matching.  It is not easy to make exact comparisons with the R-tree indexing benchmarks which I obtained, although many of the same datasets were used, because the platforms were very different.  There can be no doubt, however, that the sort/sweep algorithm is about an order of magnitude faster when one large table is cross-matched with another of comparable size, but when a small catalogue is cross-matched with a large one, the R-tree method is up to an order of magnitude faster.

This result is not too surprising: the sort/sweep algorithm reads the whole of both tables sequentially, whereas a spatial join using a DBMS seems to use a nested loop join: the smaller table is read sequentially, and for each of its rows, the matching rows in the larger table are found using the R-tree index.  An indexed lookup takes two or three disc seeks each taking around 10 ms, so if there are only a few thousands of these, this does not take much time.  But if there are hundreds of millions of them, this is inefficient.  Reading a disc sequentially is much faster, since sequential read rates of around 50 MB/sec are easy to achieve, i.e. one disc block per 0.01 ms.  The automatic disc caching which modern operating systems provide reduce this ratio somewhat, but a random disc read can be as much as 1000 times as expensive as a sequential one.  Providing one is only accessing less than 0.1% of the second catalogue, therefore, random access is probably faster, but if there are so many rows in the first catalogue that there are substantially more look-ups than that then the sequential read method would be expected to be faster, which is what we observe in practice.

Another snag with the sort/sweep algorithm is that it cannot be performed within a relational DBMS, only with specially written code (written in C++ by the CSIRO team).  It is therefore necessary first to export all the data from the DBMS, run the cross-match code, and then it will normally be necessary to ingest the results into the DBMS.  For most products, data export and ingest have to use plain text files in character-separate value format, which involves a lot of I/O overheads.

2.3 Pixel-code Algorithm


Many people have had the idea that the problem could be solved easily if the celestial sphere were to be covered with a grid of pixels, so that each point (RA,dec) maps to an integer pixel number.  This does indeed make it fairly easy to implement a simple cone-search, as this turns into a search for a list of integer pixel numbers.  Such a search can be made quite efficient by creating a B-tree index on the column of integer pixel-codes,  something that is feasible with every relational DBMS.  A number of different grids have been proposed: the simplest is the a grid of pseudo-rectangles drawn along the lines of constant RA and declination, usually known as the igloo or disco-ball grid.  This simple grid may be good enough, but depending on the exact design, it may suffer from the convergence of many grid lines at the poles.  A couple of more complex pixellations which overcome these problems are HTM, which uses triangular pixels or trixels, and HEALPix which has square ones of equal area.  Further details of these are given in http://wiki.astrogrid.org/bin/view/Astrogrid/SkyIndexing and references therein.  The diagrams below will, for simplicity, assume rectangular pixels, but the same arguments apply to other shapes.



The problem is, of course, that we are not dealing with points, but with small error-regions, which are typically ellipses or circles.  However the grid is drawn, sometimes an error ellipse will cross a pixel boundary.  It is sensible to choose a grid granularity which makes the pixels all larger than the major-axis of the largest error ellipse, so that the worst case is that an ellipse can cross four pixels (HEALPix) or six trixels (HTM).  The worst case for the igloo mapping may be higher if too many grid lines converge on the poles.  It might be thought that we could ignore these awkward cases, but as noted in http://wiki.astrogrid.org/bin/view/Astrogrid/PixelCodeSizes using a grid resolution chosen to maximise efficiency gives you pixels of about 25 arc-seconds across, and even if the largest error ellipses are no more than one arc-second across, between 12.5 and 15.5% of error ellipses will cross a pixel boundary, so if just the central position is mapped to a pixel-code, there can be a substantial loss of potentially valid cross-matches.  Naturally the loss gets much higher if the typical error ellipses are a larger fraction of the pixel size.

The naive solution is to add to the catalogues a column which can contain a list of pixel-codes.  For example, the database tables corresponding to the diagram above might look like this for cata:

ida
ra
decl
mag
pixelcodes
1
123.45
23.45
14.1
123, 124
33
123.47
23.50
15.67
124, 127

And catb:

idb
ra
decl
flux
pixelcodes
42
123.49
23.49
2.4e-3
124, 127
99
123.46
23.47
1.2e-3
124

Naively we might expect to be able to do a database join on these tables.  Unfortunately the rules of relational databases do not allow multiple values in a field, as we have for the pixelcodes column.  Fortunately there are well-established ways of handling problems like these where there is a many-to-many relationship: the methods of entity relationship modeling can be used to derive table structures which are all properly normalised according to the rules of Codd and Date.   After applying these rules, we get the following.  The tables cata and catb have their pixelcodes column removed, but two additional tables have to be created to map the the unique source identification with the pixel-code: when a source error-ellipse covers two (or more) pixels then there are simply two (or more) rows in these additional tables, which I shall call pixa and pixb

Table pixa
ida
pcode
1
123
1
124
33
124
33
127

Table pixb
idb
pcode
42
124
42
127
99
124

These two auxiliary tables will have somewhat more rows than the corresponding catalogues, but only two columns, so they do not require a huge amount of extra disc space.  The methods of entity relationship modeling tell us how to get the cross-matching result that we need. 

The next step is to join pixa and pixb on their common column of pixelcode, which can be done efficiently if a B-tree index is created on the pixelcode column of at least one of them.  The command to do this in the Postgres dialect of SQL looks like this (other dialects have slightly different syntax):
SELECT pcode, ida, idb INTO TABLE joinab FROM pixa AS a, pixb AS b WHERE a.pcode = b.pcode;

The result of the join is an new table, which we called joinab, which looks like this:

pcode
ida
idb
124
1
42
124
1
99
124
33
42
127
33
42
127
33
99

Actually the pcode column is not really needed, but I include it here just to show where the each row comes from.  Clearly the resulting table includes some duplication, because both sources a.33 and b.42 are present in pixels 124 and 127.  The way to remove duplication is to use the SQL keyword DISTINCT in the selection, like this:

SELECT DISTINCT pcode, ida, idb INTO TABLE joinab FROM pixa AS a, pixb AS b WHERE a.pcode = b.pcode;
Then the resulting table has no rows duplicated exactly:

pcode
ida
idb
124
1
42
124
1
99
127
33
42
127
33
99


The final stage is to use this joinab table in a three-way join with the original catalogues,
SELECT ida, a.ra AS raa, a.decl AS decla, mag, idb, b.ra AS rab, b.decl AS declb, flux
FROM cata AS a, joinab AS j, catb AS b
WHERE a.ida = j.ida AND j.idb = b.idb
AND greatCircleDistance(a.ra, a.decl, b.ra, b.decl) < 3;

This assumes that we have available to us a function to compute the great-circle distance between two points on the celestial sphere, and for simplicity that we want to accept matches only if the distance between the points is less than some fixed value, here 3.   In many cases, of course, the distance threshold will be some function of the individual error-ellipse parameters, but this can be handled with just a bit more complicated trigonometry.  Most DBMS allow user-defined functions like this to be written in a procedural language, such as C, Java, or pseudo-SQL.  This produces the result we wanted, thus:

ida
raa
decla
mag
idb
rab
declb
flux
1
123.45
23.45
14.1
99
123.46
23.47
1.2e-3
33
123.47
23.5
15.67
42
123.49
23.49
2.4e-3


Discussion

This algorithm may seem a bit complicated, but I have shown that it works in practice.  But it does have some problems in practice, especially for the case in which the smaller table, cata, is imported by the user, or created by user e.g. by taking a subset of some standard catalogue.  The main problems are these:
  1. After the user's catalogue is ingested into the database, it is necessary to create the pixa table, which may have several rows for each row in the original cata table.  Generating this within a DBMS is not easy: a user-defined function is normally only allowed to return a single scalar value, whereas what we need here is an array-valued function.  I think the only practical solution is to have code external to the DBMS which creates the pixel table, and then ingests it into the DBMS.
  2. The next stage depends on having a B-tree index created on both cata and pixa tables.  The CREATE INDEX operation is not one that is supported by JDBC, so this also needs special code support, which will be hard to make portable.
  3. The SELECT DISTINCT operation required when generating the join table implies sorting the results: for a large table this can be slow, and it always requires at least N log(N) operations for N rows, so its scaling is worse than linear.
The first two factors do not really apply in the case of the Astronomical Data Warehouse, where all tables can be set up with all the necessary auxiliary pixel tables and indices by the database administrator, but in this environment the cross-matching can be done more easily and efficiently using a true spatial index.

2.4 Modified Pixel-code Method

I have recently devised a modification of the preceding method which I think overcomes most of its difficulties.  This is based on the observations that, firstly, the cross-matching is carried out in two stages, and that the selectivity of the initial filter stage is not crucial to its success, and secondly, that the refinement depends upon selecting sources within the combined error-radii, but it does not matter much whether the necessary distance is associated with one source or the other.  The steps necessary are these:

After the new table, taba, is imported or created by the user, it is necessary to append a new column of pixel-code values to it, but for each row this just contains the single pixel-code (HTM, HEALPix or whatever) corresponding to the central source position.  The HTM algorithm is recursive, but the HEALPix algorithm is simple enough that I managed to encode it in a user-defined function in the Postgres procedural language, pl/pgsql.  It is likely that a similar user-defined function could be created for all other DBMS, which would allow the pcode column to be created and populated using just SQL statements, which could be carried out via the JDBC interface.  The resulting table would look like this:


ida
ra
decl
mag
pcode
1
123.45
23.45
14.1
123
33
123.47
23.50
15.67
124


It is also necessary to estimate the largest error-radius (or for ellipses the largest semi-major axis value) of any source likely to be cross-matched with the standard catalogues, Rmax .  This is likely to be just a few arc-seconds for practical cases.  This radius then needs to be added to the radius (or radii) of the error-region for each source in the catalogue catb.  The effect of this is shown approximately by the dashed circle surrounding the sources b.42 and b.99 in Figure 2 below. 




In the case of the standard catalogue, catb, which we assume has been set up by the database administrator, it is necessary to have an associated pixel table, pixb, and since the error radii are larger than before, the artificially extended error ellipse will be more likely to cross pixel-boundaries and thus will end up with more rows than before.  This time, however, there is no table pixa nor joinab.  The cross-match result can be obtained in a single three-way join, using a query like this:
SELECT ida, a.ra AS raa, a.decl AS decla, mag, idb, b.ra AS rab, b.decl AS declb, flux
FROM cata AS a, catb AS b, pixb AS p
WHERE a.pcode = p.pcode AND p.idb = b.idb
AND greatCircleDistance(a.ra, a.decl, b.ra, b.decl) < 3;

This has several advantages over the previous method
  1. It does not require any index to be created on the contents of the user's own catalogue, cata, only on those resident in the system.
  2. It requires no SELECT DISTINCT as there is no risk of duplicates.
  3. The additional column in the user's table, pcode, only contains the mapping of a point in (RA,dec) space to pixel-number, so this is just a simple scalar-valued function.  There should be no problem in providing this as a user-defined function in the DBMS, thus allow this extra column to be added and populated within the DBMS.
What are the problem areas?  The three-way join may be a little slower than that in the original pixel-code method, as the pixb tables will be have more rows, because of the extended error-radii.  The database administrator has to create the additonal tables pixb for each standard catalogue in the data collection, and the necessary index on each one.

2.5 Implementation Requirements

This section examines what an existing data centre holding some set of standard catalogues within some DBMS has to do to allow users to upload their own catalogues and cross-match them with standard ones.

  1. It is necessary to choose the "additional search radius" - that is the largest error radius likely to be encountered in the user's catalogues.
  2. Extract from each standard catalogue a table containing just two columns: the unique source identifier, and the error-radius (maximum of the error region is non-circular).
  3. A utility will then process this table, adding on the additional search radius to the value in each row, and generating a set of pixel-codes using the chosen mapping function (HTM, HEALPix, or whatever).  This outputs a new table, with columns of unique identifier and pixel-code, with one row per pixel-code per source.
  4. This table is then ingested into the DBMS as the pixel-code table.
  5. A B-tree index is created on the pixel-code column of this auxiliary table.
  6. A B-tree index is created on the unique identifier column of the original catalogue.
When the user uploads the new catalogue, it will usually be necessary to convert its format to CSV, as that is all the most DBMS will ingest.  Starlink's TOPCAT utility can be used to convert a number of formats, including FITS tables and VOTable.  But it is also necessary to add a new column to the table for the pixel-code of the central position, and of course the user will have to specify the columns holding the RA and DEC coordinates, and their units.  In many cases the units will also need to be converted.   The utility which does this can also use the HTM or HEALPix library code to produce the pixel-code value to be inserted into the additional column, and finally it can generate the SQL statements (CREATE TABLE etc.) for the creation and ingestion of the user's table into the DBMS. 

It is then easy to ask the user for the matching criterion, and to generate the SQL statement necessary to do the three-way join.
external application. 

A note on the "additonal search radius" to be chosen in step (1) above: this requires some astronomical judgement to estimate the largest search radius likely to be used in any user-catalogue that will be uploaded.  For many years the positions of X-ray source were the least well-known and likely to be most problematic, but it is worth pointing out that the XMM-Newton source positions have quoted possitonal errors of an arc-second or two, and in the latest release 99% of them have a 1-sigma radius of no more than 3 arc-seconds.  The positions from the Chandra observatory are likely to be even more precise.  To cover say a 3-sigma radius, and be fairly generous, one might consider that an additional search radius of 10 or 15 arc-seconds was reasonable.   This is small enough that a 32-bit pixel-code can be used as standard, and the number of extra pixel-codes per source will be reasonably small.

3. Conclusion

If we want to set up a simple facility for cross-matching catalogues, when these are either uploaded by the user, or created by taking a subset of some standard catalogue, the modified pixel-code method seems the best one to use.  

To extend this facility to allow the cross-matching of subsets selected from existing catalogues, the only extra facility is that of being able to submit suitable SELECT commands to the DBMS, and to store the results in situ in the DBMS.  The planned facilities of the query languate (ADQL) and of MySPACE/MyDB ought to be adequate.


Appendix: Syntax for Creating a Spatial Index

Creating a spatial index on a column spcol of a table called sptable:

DB2:
CREATE INDEX spind 
ON sptable (spcol)
EXTEND USING db2gse.spatial_index (1.0, 10.0, 100.0);

Informix:
CREATE INDEX spind ON sptable(spcol st_geometry_ops) USING 
RTREE(NO_SORT='FALSE', FILL_FACTOR='90', BOUNDING_BOX_INDEX='NO');

Ingres:
CREATE INDEX spind ON sptable(spcol)
WITH STRUCTURE=RTREE, RANGE=((-25000000,275000000),(325000000,625000000));

MySQL:

CREATE SPATIAL INDEX spind ON sptable(spcol);

Oracle-8i:

CREATE INDEX spind ON sptable(spcol) 
INDEXTYPE IS mdsys.spatial_index PARAMETERS('sdo_fanout=64 sdo_rtr_pctfree=10');
Postgres:

CREATE INDEX spind ON sptable USING RTREE(spcol);


The syntax for using a spatial index within a join shows equal diversity, which makes it very difficult to produce a portable spatial query.