Cross-matching Catalogues: Proposed
Implementation
Clive Page
2005 January 6
1. Introduction
The federation of datasets has been high on the list of priorities for
the Virtual Observatory projects but only limited progress has been
made so far. The scientific importance hardly needs to be
stressed: astronomers need to combine information from different
wavebands to get an understanding of the spectral properties of
objects, and similarly to combine information taken at different epochs
to see whether sources are variable. The Aladin service of CDS
provides good facilities for the federation of images, but facilities
for federating catalogues are still comparatively primitive.
The term federating means, I
guess, that we want to be able to use information from two datasets
without necessarily having them present on the same computer.
This is actually something of a tall order, not to say an obstacle to
progress. The mantra "ship the results not the data" is fair
enough when there all the data you need to use are resident on a single
remote server, but if you want to perform some process on two large
datasets, resident on different remote systems, this is of little
help. What you want to do is to transfer the minimum amount of
information from one site to another, since bandwidth is usually the
scarcest resource in large-scale data handling.
Ideally what is needed is a catalogue cross-matching facility which can
handle:
- Catalogues ranging in size from the very smallest to the largest
in current use (over a billion sources).
- Both personal tables in various formats which can be uploaded,
and standard catalogues resident on a remote server.
- Catalogues which are all resident in the same database, and those
distributed over the wide area network.
- Those with error-regions specified in various ways (circles,
ellipses, etc.).
I don't think we can satisfy all these requirements at once. In
particular any project which involves the cross-matching of one large
catalogue with another is bound to be a significant undertaking,
requiring astronomical and database expertise. It would be better
to concentrate initially on providing a system which can cross-match
relatively small catalogues, say up to a million sources (and which may
arise from the user's own researches, or be a subset extracted from
some standard catalogue) with standard catalogues which come in all
shapes and sizes.
1.1 Cross-matching criteria
The principal cross-matching criterion is always coincidence of
celestial position. Of course an exact match is unlikely because
there are many source of error in the apparent celestial coordinates of
sources: what is needed is a match within the overlap of
error-regions. Existing catalogues seem to come with positional
error-regions of four main types:
- All sources in the catalogue are assumed to be measured with the
same accuracy, so we can assume a circular error region of constant
radius (which one hopes to discover by reading the documentation
associated with that catalogue).
- Sources have individual positional errors typically depending on
their flux or magnitude, but these are equal along the coordinate axes,
so there is just a single column of error-radius.
- The errors along the RA and declination directions are different,
so there are two columns, of RA-error and DEC-error, specifying an
error-ellipse.
- The error region is elliptical but the axis is inclined to the
coordinate axes, so there are columns of major-axis-error,
minor-axis-error, and position angle.
More complex error regions are found in a very small number of cases,
but the concept of error-ellipse
seems general enough for the present purposes. The most serious
additional complication is that the units of the errors vary widely,
from degrees, radians, to arc-minutes and arc-seconds, or even
sub-multiples of them. This all has to be understood and handled
by the astronomer ingesting a new catalogue to the system. In due
course when standards for UCDs and units are fully implemented, it
should be possible to handle such variations automatically.
Ideally each source in the first catalogue has exactly one counterpart
in the second. In practice many sources have no apparent
counterpart, perhaps because they are too faint to have been
detected. This in itself can be valuable information, so that the
user should be given the option to record such non-matches: in database
terms this corresponds to a LEFT
OUTER JOIN. The other case, not
infrequently encountered, is that there are two or more counterparts
found on the basis of the overlap of error-regions: here it is often
possible to use other properties of the sources to work out which of
them is the most likely counterpart. Obviously considerable
astronomical expertise is needed at this stage.
The important point is that we want the database system to whittle down
the number of possible counterparts from the millions or billions to at
most a handful, after which the data volume is small enough that the
astronomer can easily apply the necessary scientific expertise to find
the single best match. The latter stage is beyond the scope of
the present document.
1.2 Filter and Refine Stages
The computation of the great-circle distance between two points on the
celestial sphere involves using several trig functions, which are
computationally expensive. The majority of cross-matching
algorithms therefore proceed in two stages
- The filter stage finds
a small number of possible matches on the basis of approximate
positional match, for example when the minimum bounding rectangles
(MBR) of the error-regions of the sources overlap.
- The refine stage carries
out the spherical trigonometry to measure actual offsets to weed out
those sources which do not actually have overlapping error
ellipses. If the astronomer wishes, this refinement stage may
also compare other properties of the sources such as magnitude or
spectral type, to reduce the number of cases in which more than one
source appear to match.
2. Possible Algorithms
2.1 Use a Spatial Index
If you need to process spatial data (such as spherical-polar
coordinates) then it seems to me to be obvious that you should use a
DBMS which supports spatial objects and provides a spatial index.
Many products provide such facilities and have them fully integrated
into the product, for example DB2, Informix, MySQL, and Postgres.
All of these except DB2 have spatial indexing based on R-trees, which
is simple and requires no configuration. DB2 uses a multi-level
grid file, which requires conversion of the coordinates to integers
with a specified range. My evaluation of spatial indexing in
MySQL and Postgres was reported in http://wiki.astrogrid.org/pub/Astrogrid/DataDocs/cross2.htm
and showed that it worked well in practice, was reasonably easy to set
up, and efficient. A few other DBMS have spatial indexing as an
optional extra, often using code from a third-party.
Unfortunately there are also a few DBMS where there seems to be no
spatial support options, and this includes Microsoft SQL Server.
To provide a cross-matching facility at a single site it is merely
necessary to use one of the DBMS with support for spatial indexing, and
use the facilities provided to do a spatial join. It would be
very easy to set up a user facility of this type at Leicester, where a
number of tables have been ingested into Postgres. Providing a
portable solution would be rather harder, especially as the syntax used
to create and use spatial indices varies so much - see Appendix A for
some of the gory details.
Unfortunately several astronomical data archive sites use DBMS which do
not have spatial indexing, so the options would seem to be
- Persuade these sites to switch from their current DBMS to one
which supports spatial indexing, or
- Find a another solution to the problem. More on this option
below.
2.2 Sort/sweep Algorithm
An alternative spatial join algorithm was invented (or discovered) by
Dave Abel, Drew Devereux, Robert Power, and Peter Lamb of CSIOR as
reported in An O(NlogM) Algorithm for Catalogue Matching.
Details of the performance are given in another technical report, Benchmarking Catalogue Cross Matching. It is
not easy to make exact comparisons with the R-tree indexing benchmarks
which I obtained, although many of the same datasets were used, because
the platforms were very different. There can be no doubt,
however, that the sort/sweep algorithm is about an order of magnitude
faster when one large table is cross-matched with another of comparable
size, but when a small catalogue is cross-matched with a large one, the
R-tree method is up to an order of magnitude faster.
This result is not too surprising: the sort/sweep algorithm reads the
whole of both tables sequentially, whereas a spatial join using a DBMS
seems to use a nested loop join: the smaller table is read
sequentially, and for each of its rows, the matching rows in the larger
table are found using the R-tree index. An indexed lookup takes
two or three disc seeks each taking around 10 ms, so if there are only
a few thousands of these, this does not take much time. But if
there are hundreds of millions of them, this is inefficient.
Reading a disc sequentially is much faster, since sequential read rates
of around 50 MB/sec are easy to achieve, i.e. one disc block per 0.01
ms. The automatic disc caching which modern operating systems
provide reduce this ratio somewhat, but a random disc read can be as
much as 1000 times as expensive as a sequential one. Providing
one is only accessing less than 0.1% of the second catalogue,
therefore, random access is probably faster, but if there are so many
rows in the first catalogue that there are substantially more look-ups
than that then the sequential read method would be expected to be
faster, which is what we observe in practice.
Another snag with the sort/sweep algorithm is that it cannot be
performed within a relational DBMS, only with specially written code
(written in C++ by the CSIRO team). It is therefore necessary
first to export all the data from the DBMS, run the cross-match code,
and then it will normally be necessary to ingest the results into the
DBMS. For most products, data export and ingest have to use plain
text files in character-separate value format, which involves a lot of
I/O overheads.
2.3 Pixel-code Algorithm
Many people have had the idea that the problem could be solved easily
if the celestial sphere were to be covered with a grid of pixels, so
that each point (RA,dec) maps to an
integer pixel number. This does indeed make it fairly easy to
implement a simple cone-search, as this turns into a search for a list
of integer pixel numbers. Such a search can be made quite
efficient by creating a B-tree index on the column of integer
pixel-codes, something that is feasible with every relational
DBMS. A number of different grids have
been proposed: the simplest is the a grid of pseudo-rectangles drawn
along the lines of constant RA and declination, usually known as the
igloo or disco-ball grid. This simple grid may be good enough,
but depending on the exact design, it may suffer from the convergence
of many grid lines at the poles. A couple of more complex
pixellations which overcome these problems are HTM, which uses
triangular pixels or trixels, and HEALPix which has square ones of
equal area. Further details of these are given in http://wiki.astrogrid.org/bin/view/Astrogrid/SkyIndexing
and references therein. The diagrams below will, for simplicity,
assume rectangular pixels, but the same arguments apply to other shapes.

The problem is, of course, that we are not dealing with points, but
with small error-regions, which are typically ellipses or
circles.
However the grid is drawn, sometimes an error ellipse will cross a
pixel boundary. It is sensible to choose a grid granularity which
makes the pixels all larger than the major-axis of the largest
error ellipse, so that the worst case is that an ellipse can cross four
pixels (HEALPix) or six trixels (HTM). The worst case for the
igloo mapping may be higher if too many grid lines converge on the
poles. It might be thought that we could ignore
these awkward cases, but as noted in http://wiki.astrogrid.org/bin/view/Astrogrid/PixelCodeSizes
using a grid resolution chosen to maximise efficiency gives you pixels
of about 25 arc-seconds
across, and even if the largest error ellipses are no more than one
arc-second across, between 12.5 and 15.5% of error ellipses will cross
a pixel boundary, so if just the central position is mapped to a
pixel-code, there can be a substantial loss of potentially valid
cross-matches. Naturally the loss gets much higher if the typical
error ellipses are a larger fraction of the pixel size.
The naive solution is to add to the catalogues a
column which can contain a list of pixel-codes. For example, the
database tables corresponding to the diagram above might look like this
for cata:
ida
|
ra
|
decl
|
mag
|
pixelcodes
|
1
|
123.45
|
23.45
|
14.1
|
123, 124
|
33
|
123.47
|
23.50
|
15.67
|
124, 127
|
And catb:
idb
|
ra
|
decl
|
flux
|
pixelcodes
|
42
|
123.49
|
23.49
|
2.4e-3
|
124, 127
|
99
|
123.46
|
23.47
|
1.2e-3
|
124
|
Naively we might expect to be able to do a database join on these
tables. Unfortunately the rules of relational databases do not
allow multiple values in a field, as we have for the pixelcodes column. Fortunately
there are well-established ways of handling problems like
these where there is a many-to-many relationship: the methods of entity
relationship modeling can be used to
derive table structures which are all properly normalised according to
the rules of Codd and Date.
After applying these rules, we get the following. The tables cata and catb have their pixelcodes column
removed, but two additional tables have to be created to map the the
unique source identification with the pixel-code: when a source
error-ellipse covers two (or more) pixels then there are simply two (or
more) rows in these additional tables, which I
shall call pixa and pixb.
Table pixa
ida
|
pcode
|
1
|
123
|
1
|
124
|
33
|
124
|
33
|
127
|
Table pixb
idb
|
pcode
|
42
|
124
|
42
|
127
|
99
|
124
|
These two auxiliary tables will have somewhat more rows than the
corresponding catalogues, but only two columns, so they do not require
a huge amount of extra disc space. The methods of entity
relationship modeling tell us how to get the cross-matching result that
we need.
The next step is
to join pixa and pixb on their common column of pixelcode, which can be
done efficiently if a B-tree index is created on the pixelcode column
of at least one of them. The command to do this in the Postgres
dialect of SQL looks like this (other dialects have slightly different
syntax):
SELECT pcode, ida, idb INTO TABLE joinab FROM pixa AS a, pixb AS b WHERE a.pcode = b.pcode;
The result of the join is an new table, which we called joinab, which looks like this:
pcode
|
ida
|
idb
|
124
|
1
|
42
|
124
|
1
|
99
|
124
|
33
|
42
|
127
|
33
|
42
|
127
|
33
|
99
|
Actually the pcode column is not really needed, but I include it here
just to show where the each row comes from. Clearly the resulting
table includes some duplication, because both sources a.33 and b.42 are
present in
pixels 124 and 127. The way to remove duplication is to use the
SQL keyword DISTINCT in
the selection, like this:
SELECT DISTINCT pcode, ida, idb INTO TABLE joinab FROM pixa AS a, pixb AS b WHERE a.pcode = b.pcode;
Then the resulting table has no rows duplicated exactly:
pcode
|
ida
|
idb
|
124
|
1
|
42
|
124
|
1
|
99
|
127
|
33
|
42
|
127
|
33
|
99
|
The final stage is to use this joinab table in a three-way join with
the
original catalogues,
SELECT ida, a.ra AS raa, a.decl AS decla, mag, idb, b.ra AS rab, b.decl AS declb, flux
FROM cata AS a, joinab AS j, catb AS b
WHERE a.ida = j.ida AND j.idb = b.idb
AND greatCircleDistance(a.ra, a.decl, b.ra, b.decl) < 3;
This assumes that we have available to us a function to compute the
great-circle distance between two points on the celestial sphere, and
for simplicity that we want to accept matches only if the distance
between the points is less than some fixed value, here 3.
In many cases, of course, the distance threshold will be some function
of the individual error-ellipse parameters, but this can be handled
with just a bit more complicated trigonometry. Most DBMS allow
user-defined functions like this to be written in a procedural
language, such as C, Java, or pseudo-SQL. This produces the
result we wanted, thus:
ida
|
raa
|
decla
|
mag
|
idb
|
rab
|
declb
|
flux
|
1
|
123.45
|
23.45
|
14.1
|
99
|
123.46
|
23.47
|
1.2e-3
|
33
|
123.47
|
23.5
|
15.67
|
42
|
123.49
|
23.49
|
2.4e-3
|
Discussion
This algorithm may seem a bit complicated, but I have shown that
it works in practice. But it does have some problems in practice,
especially for the case in which the smaller table, cata, is imported
by the user, or created by user e.g. by taking a subset of some
standard catalogue. The main problems are these:
- After the user's catalogue is ingested into the database, it is
necessary to create the pixa
table, which may have several rows for each row in the original cata table. Generating this
within a DBMS is not easy: a user-defined function is normally only
allowed to return a single scalar value, whereas what we need here is
an array-valued function. I think the only practical solution is
to have code external to the DBMS which creates the pixel table, and
then ingests it into the DBMS.
- The next stage depends on having a B-tree index created on both
cata and pixa tables. The CREATE
INDEX operation is not one that is supported by JDBC, so this
also needs special code support, which will be hard to make portable.
- The SELECT DISTINCT
operation required when generating the join table implies sorting the
results: for a large table this can be slow, and it always requires at
least N log(N) operations for N rows, so its scaling is worse than
linear.
The first two factors do not really apply in the case of the
Astronomical Data Warehouse, where all tables can be set up with all
the necessary auxiliary pixel tables and indices by the database
administrator, but in this environment the cross-matching can be done
more easily and efficiently using a true spatial index.
2.4 Modified Pixel-code Method
I have recently devised a modification of the preceding method which I
think overcomes most of its difficulties. This is based on the
observations that, firstly, the cross-matching is carried out in two
stages, and that the selectivity of the initial filter stage is not
crucial to its success, and secondly, that the refinement depends upon
selecting sources within the combined error-radii, but it does not
matter much whether the necessary distance is associated with one
source or the other. The steps necessary are these:
After the new table, taba, is
imported or created by the user, it is necessary to append a new column
of pixel-code values to it, but for each row this just contains the
single pixel-code (HTM, HEALPix or whatever) corresponding to the
central source position. The HTM algorithm is recursive, but the
HEALPix algorithm is simple enough that I managed to encode it in a
user-defined function in the Postgres procedural language,
pl/pgsql. It is likely that a similar user-defined function could
be created for all other DBMS, which would allow the pcode column to be
created and populated using just SQL statements, which could be carried
out via the JDBC interface. The resulting table would look like
this:
ida
|
ra
|
decl
|
mag
|
pcode
|
1
|
123.45
|
23.45
|
14.1
|
123
|
33
|
123.47
|
23.50
|
15.67
|
124
|
It is also necessary to estimate the largest error-radius (or for
ellipses the largest semi-major axis value) of any source likely to be
cross-matched with the standard catalogues, Rmax .
This is likely to be just a few arc-seconds for practical cases.
This radius then needs to be added to the radius (or radii) of the
error-region for each source in the catalogue catb. The effect of this is
shown approximately by the dashed circle surrounding the sources b.42
and b.99 in Figure 2 below.

In the case of the standard catalogue, catb, which we assume has been
set up by the database administrator, it is necessary to have an
associated pixel table, pixb,
and since the error radii are larger than before, the artificially
extended error ellipse will be more likely to cross pixel-boundaries
and thus will end up with more rows than before. This time,
however,
there is no table pixa nor joinab. The cross-match result can be
obtained in a
single three-way join, using a query like this:
SELECT ida, a.ra AS raa, a.decl AS decla, mag, idb, b.ra AS rab, b.decl AS declb, flux
FROM cata AS a, catb AS b, pixb AS p
WHERE a.pcode = p.pcode AND p.idb = b.idb
AND greatCircleDistance(a.ra, a.decl, b.ra, b.decl) < 3;
This has several advantages over the previous method
- It does not require any index to be created on the contents of
the user's own catalogue, cata, only on those resident in the system.
- It requires no SELECT
DISTINCT as there is no risk of duplicates.
- The additional column in the user's table, pcode, only contains
the mapping of a point in (RA,dec) space to pixel-number, so this is
just a simple scalar-valued function. There should be no problem
in providing this as a user-defined function in the DBMS, thus allow
this extra column to be added and populated within the DBMS.
What are the problem areas? The three-way join may be a little
slower than that in the original pixel-code method, as the pixb tables
will be have more rows, because of the extended error-radii. The
database administrator has to create the additonal tables pixb for each
standard catalogue in the data collection, and the necessary index on
each one.
2.5 Implementation Requirements
This section examines what an existing data centre holding some set of
standard catalogues within some DBMS has to do to allow users to upload
their own catalogues and cross-match them with standard ones.
- It is necessary to choose the "additional search radius" - that
is the largest error radius likely to be encountered in the user's
catalogues.
- Extract from each standard catalogue a table containing just two
columns: the unique source identifier, and the error-radius (maximum of
the error region is non-circular).
- A utility will then process this table, adding on the additional
search radius to the value in each row, and generating a set of
pixel-codes using the chosen mapping function (HTM, HEALPix, or
whatever). This outputs a new table, with columns of unique
identifier and pixel-code, with one row per pixel-code per source.
- This table is then ingested into the DBMS as the pixel-code table.
- A B-tree index is created on the pixel-code column of this
auxiliary table.
- A B-tree index is created on the unique identifier column of the
original catalogue.
When the user uploads the new catalogue, it will usually be necessary
to convert its format to CSV, as that is all the most DBMS will
ingest. Starlink's TOPCAT utility can be used to convert a number
of formats, including FITS tables and VOTable. But it is also
necessary to add a new column to the table for the pixel-code of the
central position, and of course the user will have to specify the
columns holding the RA and DEC coordinates, and their units. In
many cases the units will also need to be converted. The
utility which does this can also use the HTM or HEALPix library code to
produce the pixel-code value to be inserted into the additional column,
and finally it can generate the SQL statements (CREATE TABLE etc.) for
the creation and ingestion of the user's table into the DBMS.
It is then easy to ask the user for the matching criterion, and to
generate the SQL statement necessary to do the three-way join.
external application.
A note on the "additonal search radius" to be chosen in step (1) above:
this requires some astronomical judgement to estimate the largest
search radius likely to be used in any user-catalogue that will be
uploaded. For many years the positions of X-ray source were the
least well-known and likely to be most problematic, but it is worth
pointing out that the XMM-Newton source positions have quoted
possitonal errors of an arc-second or two, and in the latest release
99% of them have a 1-sigma radius of no more than 3 arc-seconds.
The positions from the Chandra observatory are likely to be even more
precise. To cover say a 3-sigma radius, and be fairly generous,
one might consider that an additional search radius of 10 or 15
arc-seconds was reasonable. This is small enough that a
32-bit pixel-code can be used as standard, and the number of extra
pixel-codes per source will be reasonably small.
3. Conclusion
If we want to set up a simple facility for cross-matching catalogues,
when these are either uploaded by the user, or created by taking a
subset of some standard catalogue, the modified pixel-code method
seems the best one to use.
To extend this facility to allow the cross-matching of subsets selected
from existing catalogues, the only extra facility is that of being able
to submit suitable SELECT commands to the DBMS, and to store the
results in situ in the DBMS. The planned facilities of the query
languate (ADQL) and of MySPACE/MyDB ought to be adequate.
Appendix: Syntax for Creating a Spatial Index
Creating a spatial index on a column spcol of a table called sptable:
DB2:
CREATE INDEX spind
ON sptable (spcol)
EXTEND USING db2gse.spatial_index (1.0, 10.0, 100.0);
Informix:
CREATE INDEX spind ON sptable(spcol st_geometry_ops) USING
RTREE(NO_SORT='FALSE', FILL_FACTOR='90', BOUNDING_BOX_INDEX='NO');
Ingres:
CREATE INDEX spind ON sptable(spcol)
WITH STRUCTURE=RTREE, RANGE=((-25000000,275000000),(325000000,625000000));
MySQL:
CREATE SPATIAL INDEX spind ON sptable(spcol);
Oracle-8i:
CREATE INDEX spind ON sptable(spcol)
INDEXTYPE IS mdsys.spatial_index PARAMETERS('sdo_fanout=64 sdo_rtr_pctfree=10');
Postgres:
CREATE INDEX spind ON sptable USING RTREE(spcol);
The syntax for using a spatial index within a join shows equal
diversity, which makes it very difficult to produce a portable spatial
query.