Association Methods for WP-A5
Task:
A5.Y1.Q3-11 from Section 5.2.3 of the
Project Description.
Introduction
The main goal of
WP-A5.2 is to assess association methods in the VO context, via seeking optical counterparts for
XMM-Newton sources in catalogues of
INT-WFC and
SuperCOSMOS Sky Survey data. This topic lies
at the very heart of the VO concept, since without reliable and efficient association techniques, the effort spent to make a large variety of astronomical databases interoperable via the VO infrastructure has been in vain. As noted in the
WPA5 Final Report, the work on
WP-A5.2 was suspended part way through Phase A, due to delays with the delivery of the
XMM-Newton source catalogue, so this note does not report the results of hands-on experiments conducted as part of
WP-A5.2, but rather outlines the problem it was designed to address and discusses some possible solutions to it. This remains a significant problem for
AstroGrid, and for the VO in general, and, now that the
XMM-Newton source catalogue is available, it should be addressed (in part, at least) either by a
resurrected
WP-A5.2, running in Phase A's Q5, or as part of the work of the
XMM-Newton Survey Science Centre.
The Problem
The association problem boils down to the question of how best to find the object in one catalogue most likely to be the true counterpart of a particular source in another catalogue. For the VO, this basic problem is extended somewhat. The VO aspires to give the user access to many, large databases stored at different locations. This means that a good VO association technique should have performance that scales well to large sample sizes, and should not require the transfer of vast quantities of data over the network between geographically-separated databases. Furthermore, if the VO's advocates are correct in assuming that the data mining of multiwavelength datasets will rapidly increase in importance once the VO makes their integration possible, then it is likely that associations made between a pair of popular datasets by one researcher are likely to be of interest to many subsequent researchers. This implies not only that the associations themselves should be stored in some form, but that the method used to make them be recorded, too, so that later users can assess whether the procedure used is adequate for their particular purpose.
So, the VO association problem may be broken up into three constitutent problems:
- how best to make associations between astronomical objects
- how to implement efficiently the association of objects in distributed databases
- how to record the associations made and the methods used to make them so that subsequent users can make use of that information
Of course, these three questions are not independent: the best association algorithm, astrophysically-speaking, may be very expensive computationally when run on large datasets, and also difficult to describe in simple metadata. Despite that, we shall take these three questions in turn here.
How best to make associations between astronomical objects
Association by proximity
Astronomical objects have a natural index - position in the sky - so one might think that the key to making associations between astronomical objects was the use of spatial indexing techniques to match sources in different catalogues by proximity. As Clive Page's document
Indexing the Sky makes clear this seemingly simple operation is not so easy, for a number of reasons.
Firstly, the celestial sphere is a two-dimensional surface and a curved one at that, both of which make sky indexing non-trivial given the 1D (e.g. B-tree) indexing schemes typically used by database management systems. Solutions to this problem exist, though. For example, there are two schemes well known in astronomy for pixelising the sphere and labelling the pixels in a manner well suited to 1D indexation:
HEALPix (for
Hierarchical
Equal
Area Iso
latitude
Pixelisation), developed by Kris Gorski and collaborators, and the
HTM (for
Hierarchical
Triangular
Mesh) scheme, from the SDSS database group at Johns Hopkins. The simple geometric origin of both these pixellisations means that it is relatively simple to compute the set of pixels (at a given resolution level) that intersect with a given region of the celestial sphere, and their hierarchical nature means that it is then possible to look for entries in a database table lying within those pixels at the appropriate resolution scale. Secondly, astronomical positions have uncertainties, so it is necessary to make a
fuzzy join, matching pairs of sources within a certain distance of each other, rather than making an exact positional match, which would be an easier operation for a relational database. As Page describes, B-tree indexes using pixelisation schemes such as
HTM or
HEALPix can produce fairly
efficient matching of near neighbours, even for large databases, using
his PCODE algorithm.
Finally, the results of such a pixel-based neighbour-finding have to be filtered, by great circle distance, to complete the fuzzy join, and the trigonometry of the great circle distance formulae is somewhat complicated to write in SQL - certainly something to be implement as a user-defined function, rather than leaving it to the individual user to type in. Furthermore, vendor-specific deviations from the SQL92 standard mean that the implementation of such formulae will vary between different database management systems. (
N.B. This discussion assumes that the object and source catalogues have consistent astrometry, which will not always be the case in the VO context, at least initially; part of the procedure of asssociation by proximity might involve using one dataset to improve the astrometry of the other - for example, using high resolution radio data tied to the ICRF to tweak the astrometric reference frame of an optical catalogue.)
Why association by proximity is often not enough
In many of the most interesting astronomical applications,
association by proximity is not adequate, due, in some sense, to the "fuzziness" of the fuzzy join involved. The recent preprint by
Dunlop et al., which
discusses the quest for the identification of the source HDF850.1
neatly illustrates this point. When
Hughes et al. (1998) discovered HDF850.1, using the
SCUBA instrument on the James Clerk Maxwell Telescope (
JCMT), they
sought an optical counterpart in the
Hubble Deep Field data of
Williams et al. (1996), the deepest
optical data in existence: HDF850.1, and the other sources detected by
SCUBA in the
HDF, were expected to be very faint optically, so the deepest possible optical data were required for the association procedure. The angular
resolution of the
SCUBA data is relatively poor - the FWHM of the beam is about 15 arcsec - which means that, with optical data as deep as those in the
HDF, there will be, on average, several optical objects lying within the positional error ellipse of each
SCUBA source. In this case, positional proximity is not enough to determine which of these is most likely to be the true optical counterpart. This
SCUBA-
HDF example may be a little extreme, but there are many situations in which the positional information is not good enough to yield an unambiguous identification by itself, and astrophysical information about the objects involved has to be folded into the association procedure.
Association methods using astrophysical information
There exist a number of methods for including astrophysical information into association procedures. The well known techniques are all probabilistic, and all boil down to looking at all possible
candidates (e.g. all objects within some maximum search radius of
the source, whose value is set by the estimated uncertainties in the source and object positions) and assigning to each a probability that its configuration (its distance from the source position, given its
astrophysical properties) could occur by chance (i.e. with that object not being the true counterpart of the source). The favoured association is then the object with the lowest probability - i.e. that with a configuration least likely to have arisen by chance.
Perhaps the simplest of these prescriptions is the Poisson probability model (e.g.
Downes et al. (1986), which, applied to our
SCUBA-
HDF example, would
run as follows. For each
SCUBA source, find all optical galaxies out to some maximum search radius and note the distance,
d, which each of them is from the source and its optical magnitude,
m. Then, for each of
them, estimate the probability,
P0=1-exp(-pi*d2*N(<m)), where
N(<m) is the number density of objects in the
HDF catalogue at least as bright as the particular object. The probability,
P, used to select the most likely optical counterpart is then computed from
P0 using a factor (depending on the limiting magnitude of the optical catalogue and the maximum search radius used) which corrects for the fact that other combinations of
d and
m could have produced the same value of
P0.
The Poisson probability method has the benefit of simplicity, but more sophisticated prescriptions can also include information about the source population. Most simply,
if the source can be classified according to some property, then only objects of that class need be included in the association procedure: so, in our
SCUBA-
HDF example, we could have excluded the few stars in the optical
HDF catalogue, on the assumption that our
SCUBA source could not be a star. More sophisticated algorithms can take account of the known properties of the source population. For example, the Likelihood Ratio method (e.g.
Sutherland and Saunders (1992), selects as the most favoured association from a set of counterparts that with the largest value of a likelihood ratio, which is defined to be
the ratio of the infinitesimal probability of finding the true counterpart to the source at the position of the object and with its flux to the infinitesimal probability of an object with that flux being found there by chance (and, clearly, this approach can work equally well for more or different quantities than flux - for example, an extragalactic survey might include photometric redshift estimates of optical galaxies, and a Galactic survey might include velocity information of particular objects). In the
Sutherland and Saunders (1992) formulation, a model for the flux distribution of the true counterparts is employed, while, in the more recent variant presented by
Rutledge et al. (2000), the analogous information is estimated empirically, but studying "on-source" and "background" fields, in a manner similar to that employed by
Mann et al. (1997) and
Mann et al. (2002).
How to implement efficiently the association of objects in distributed databases
The integration of data from distributed databases remains an active research topic in computer science, so it is not surprising that no general solution has been implemented within astronomy yet. The complication of the inclusion of astrophysical information means that those partial solutions which have been prototyped have been restricted to associations by proximity alone. The most noteworthy of these is the
SkyQuery system developed by the Johns Hopkins group. This prototype uses web services to enable the spatial matching of objects in SQL Server implementations of sky survey catalogues at three different locations: Sloan EDR at Fermilab, 2MASS at Caltech and a copy of FIRST loaded into SQLServer at JHU. (Note that the actual cross-match algorithm is implemented at a stored procedure that can run on all three SQL Server instances; this introduces a restriction that
SkyQuery, at least in its current form, is not readily applicable to catalogues stored in other DBMSs.)
SkyQuery seeks to avoid problems with moving vast amounts of data over the network by working out a query execution plan that moves data between the three databases in the order that minimises such traffic. The correct execution plan is worked out by querying each database in turn to find out how many entries each has that might satisfy the user's query. This also has the effect of bringing those objects into cache, so that the execution of the query does not require them to be read from disk for a second time.
It is not clear whether this neat recursive procedure,
handing data up from database to database, would be possible with a
more sophisticated association prescription - e.g. the likelihood ratio technique - as its use may be dependent on the simplicity of the purely positional case, which is symmetric between the "object"
and "source" catalogues. The more general case may be more complicated, but it is not clear whether even it would require the shipping of large quantities of data across the network. It is more
likely that they will be a combination of an initial fuzzy join, which only requires the transfer of the positions of the sources to the object database, and the running of some stored procedure which
implements the estimation of a likelihood ratio (or equivalent) for
the much smaller number of objects falling within the initial search
radius selected by the fuzzy join. This procedure may be greatly helped by pre-computing sets of "cross-neighbours" between pairs of popular databases - for example, producing a table that records all
SDSS objects that are within 15 arcsec of
each source in the
SuperCOSMOS Sky Survey database. Such a table could form the basis of a wide range
of possible association procedures, all of which start with an initial proximity search but which may do anything thereafter.
How to record the associations made and the methods used to make them so that subsequent users can make use of that information
This third question is the one for which a solution is least apparent now. It is not usual for the results of association procedures to be stored in a publicly-accessible manner (other than as part of the results written up into scientific papers), nor is there a clear way to describe, say, the details of a likelihood ratio procedure using
metadata that can be readily interrogated. This latter issue would
seem to be part of a much wider topic of how to express the provenance of derived data, and so may be addressed within other areas of escience.
Application to WP-A5.2
The first part of the
WP-A5.2 association problem has already been addressed, under the auspices of
WP-A4, where an X-ray/optical cross-match was performed using a ROSAT catalogue. This worked as expected, but the density of sources on the sky was insufficient to challenge the algorithm thoroughly. If
WP-A5.2 is resurrected, when the first
XMM-Newton source catalogue is available, it would be interesting to address some of these issues. For example, it would be illuminating to see how easy it is to write something like a likelihood ratio analysis as a stored procedure in a DBMS, and also see how the creation of a cross-neighbours table can speed up the fuzzy join that would precede that: the latter is, in fact, also likely to be investigated by the resurrected optical/near-IR pilot of
WP-A5.1, to facilitate the association of
SDSS and
the
SuperCOSMOS Sky Survey objects within the
SDSS Early Data Release region.
--
BobMann - 14 Sep 2002