r4 - 25 Sep 2002 - 12:19:54 - BobMannYou are here: TWiki >  Astrogrid Web  >  DocStore > WpDocs > WorkPackages > WP-A5 > WPA5AssociationMethods

Association Methods for WP-A5

Task: A5.Y1.Q3-11 from Section 5.2.3 of the Project Description.

Introduction

The main goal of WP-A5.2 is to assess association methods in the VO context, via seeking optical counterparts for XMM-Newton sources in catalogues of INT-WFC and SuperCOSMOS Sky Survey data. This topic lies at the very heart of the VO concept, since without reliable and efficient association techniques, the effort spent to make a large variety of astronomical databases interoperable via the VO infrastructure has been in vain. As noted in the WPA5 Final Report, the work on WP-A5.2 was suspended part way through Phase A, due to delays with the delivery of the XMM-Newton source catalogue, so this note does not report the results of hands-on experiments conducted as part of WP-A5.2, but rather outlines the problem it was designed to address and discusses some possible solutions to it. This remains a significant problem for AstroGrid, and for the VO in general, and, now that the XMM-Newton source catalogue is available, it should be addressed (in part, at least) either by a resurrected WP-A5.2, running in Phase A's Q5, or as part of the work of the XMM-Newton Survey Science Centre.

The Problem

The association problem boils down to the question of how best to find the object in one catalogue most likely to be the true counterpart of a particular source in another catalogue. For the VO, this basic problem is extended somewhat. The VO aspires to give the user access to many, large databases stored at different locations. This means that a good VO association technique should have performance that scales well to large sample sizes, and should not require the transfer of vast quantities of data over the network between geographically-separated databases. Furthermore, if the VO's advocates are correct in assuming that the data mining of multiwavelength datasets will rapidly increase in importance once the VO makes their integration possible, then it is likely that associations made between a pair of popular datasets by one researcher are likely to be of interest to many subsequent researchers. This implies not only that the associations themselves should be stored in some form, but that the method used to make them be recorded, too, so that later users can assess whether the procedure used is adequate for their particular purpose.

So, the VO association problem may be broken up into three constitutent problems:

  • how best to make associations between astronomical objects
  • how to implement efficiently the association of objects in distributed databases
  • how to record the associations made and the methods used to make them so that subsequent users can make use of that information

Of course, these three questions are not independent: the best association algorithm, astrophysically-speaking, may be very expensive computationally when run on large datasets, and also difficult to describe in simple metadata. Despite that, we shall take these three questions in turn here.

How best to make associations between astronomical objects

Association by proximity

Astronomical objects have a natural index - position in the sky - so one might think that the key to making associations between astronomical objects was the use of spatial indexing techniques to match sources in different catalogues by proximity. As Clive Page's document Indexing the Sky makes clear this seemingly simple operation is not so easy, for a number of reasons. Firstly, the celestial sphere is a two-dimensional surface and a curved one at that, both of which make sky indexing non-trivial given the 1D (e.g. B-tree) indexing schemes typically used by database management systems. Solutions to this problem exist, though. For example, there are two schemes well known in astronomy for pixelising the sphere and labelling the pixels in a manner well suited to 1D indexation:HEALPix (for Hierarchical Equal Area Isolatitude Pixelisation), developed by Kris Gorski and collaborators, and the HTM (for Hierarchical Triangular Mesh) scheme, from the SDSS database group at Johns Hopkins. The simple geometric origin of both these pixellisations means that it is relatively simple to compute the set of pixels (at a given resolution level) that intersect with a given region of the celestial sphere, and their hierarchical nature means that it is then possible to look for entries in a database table lying within those pixels at the appropriate resolution scale. Secondly, astronomical positions have uncertainties, so it is necessary to make a fuzzy join, matching pairs of sources within a certain distance of each other, rather than making an exact positional match, which would be an easier operation for a relational database. As Page describes, B-tree indexes using pixelisation schemes such as HTM or HEALPix can produce fairly efficient matching of near neighbours, even for large databases, using his PCODE algorithm.

Finally, the results of such a pixel-based neighbour-finding have to be filtered, by great circle distance, to complete the fuzzy join, and the trigonometry of the great circle distance formulae is somewhat complicated to write in SQL - certainly something to be implement as a user-defined function, rather than leaving it to the individual user to type in. Furthermore, vendor-specific deviations from the SQL92 standard mean that the implementation of such formulae will vary between different database management systems. (N.B. This discussion assumes that the object and source catalogues have consistent astrometry, which will not always be the case in the VO context, at least initially; part of the procedure of asssociation by proximity might involve using one dataset to improve the astrometry of the other - for example, using high resolution radio data tied to the ICRF to tweak the astrometric reference frame of an optical catalogue.)

Why association by proximity is often not enough

In many of the most interesting astronomical applications, association by proximity is not adequate, due, in some sense, to the "fuzziness" of the fuzzy join involved. The recent preprint by Dunlop et al., which discusses the quest for the identification of the source HDF850.1 neatly illustrates this point. When Hughes et al. (1998) discovered HDF850.1, using the SCUBA instrument on the James Clerk Maxwell Telescope (JCMT), they sought an optical counterpart in the Hubble Deep Field data of Williams et al. (1996), the deepest optical data in existence: HDF850.1, and the other sources detected by SCUBA in the HDF, were expected to be very faint optically, so the deepest possible optical data were required for the association procedure. The angular resolution of the SCUBA data is relatively poor - the FWHM of the beam is about 15 arcsec - which means that, with optical data as deep as those in the HDF, there will be, on average, several optical objects lying within the positional error ellipse of each SCUBA source. In this case, positional proximity is not enough to determine which of these is most likely to be the true optical counterpart. This SCUBA-HDF example may be a little extreme, but there are many situations in which the positional information is not good enough to yield an unambiguous identification by itself, and astrophysical information about the objects involved has to be folded into the association procedure.

Association methods using astrophysical information

There exist a number of methods for including astrophysical information into association procedures. The well known techniques are all probabilistic, and all boil down to looking at all possible candidates (e.g. all objects within some maximum search radius of the source, whose value is set by the estimated uncertainties in the source and object positions) and assigning to each a probability that its configuration (its distance from the source position, given its astrophysical properties) could occur by chance (i.e. with that object not being the true counterpart of the source). The favoured association is then the object with the lowest probability - i.e. that with a configuration least likely to have arisen by chance.

Perhaps the simplest of these prescriptions is the Poisson probability model (e.g. Downes et al. (1986), which, applied to our SCUBA-HDF example, would run as follows. For each SCUBA source, find all optical galaxies out to some maximum search radius and note the distance, d, which each of them is from the source and its optical magnitude, m. Then, for each of them, estimate the probability, P0=1-exp(-pi*d2*N(<m)), where N(<m) is the number density of objects in the HDF catalogue at least as bright as the particular object. The probability, P, used to select the most likely optical counterpart is then computed from P0 using a factor (depending on the limiting magnitude of the optical catalogue and the maximum search radius used) which corrects for the fact that other combinations of d and m could have produced the same value of P0.

The Poisson probability method has the benefit of simplicity, but more sophisticated prescriptions can also include information about the source population. Most simply, if the source can be classified according to some property, then only objects of that class need be included in the association procedure: so, in our SCUBA-HDF example, we could have excluded the few stars in the optical HDF catalogue, on the assumption that our SCUBA source could not be a star. More sophisticated algorithms can take account of the known properties of the source population. For example, the Likelihood Ratio method (e.g. Sutherland and Saunders (1992), selects as the most favoured association from a set of counterparts that with the largest value of a likelihood ratio, which is defined to be the ratio of the infinitesimal probability of finding the true counterpart to the source at the position of the object and with its flux to the infinitesimal probability of an object with that flux being found there by chance (and, clearly, this approach can work equally well for more or different quantities than flux - for example, an extragalactic survey might include photometric redshift estimates of optical galaxies, and a Galactic survey might include velocity information of particular objects). In the Sutherland and Saunders (1992) formulation, a model for the flux distribution of the true counterparts is employed, while, in the more recent variant presented by Rutledge et al. (2000), the analogous information is estimated empirically, but studying "on-source" and "background" fields, in a manner similar to that employed by Mann et al. (1997) and Mann et al. (2002).

How to implement efficiently the association of objects in distributed databases

The integration of data from distributed databases remains an active research topic in computer science, so it is not surprising that no general solution has been implemented within astronomy yet. The complication of the inclusion of astrophysical information means that those partial solutions which have been prototyped have been restricted to associations by proximity alone. The most noteworthy of these is the SkyQuery system developed by the Johns Hopkins group. This prototype uses web services to enable the spatial matching of objects in SQL Server implementations of sky survey catalogues at three different locations: Sloan EDR at Fermilab, 2MASS at Caltech and a copy of FIRST loaded into SQLServer at JHU. (Note that the actual cross-match algorithm is implemented at a stored procedure that can run on all three SQL Server instances; this introduces a restriction that SkyQuery, at least in its current form, is not readily applicable to catalogues stored in other DBMSs.)

SkyQuery seeks to avoid problems with moving vast amounts of data over the network by working out a query execution plan that moves data between the three databases in the order that minimises such traffic. The correct execution plan is worked out by querying each database in turn to find out how many entries each has that might satisfy the user's query. This also has the effect of bringing those objects into cache, so that the execution of the query does not require them to be read from disk for a second time.

It is not clear whether this neat recursive procedure, handing data up from database to database, would be possible with a more sophisticated association prescription - e.g. the likelihood ratio technique - as its use may be dependent on the simplicity of the purely positional case, which is symmetric between the "object" and "source" catalogues. The more general case may be more complicated, but it is not clear whether even it would require the shipping of large quantities of data across the network. It is more likely that they will be a combination of an initial fuzzy join, which only requires the transfer of the positions of the sources to the object database, and the running of some stored procedure which implements the estimation of a likelihood ratio (or equivalent) for the much smaller number of objects falling within the initial search radius selected by the fuzzy join. This procedure may be greatly helped by pre-computing sets of "cross-neighbours" between pairs of popular databases - for example, producing a table that records all SDSS objects that are within 15 arcsec of each source in the SuperCOSMOS Sky Survey database. Such a table could form the basis of a wide range of possible association procedures, all of which start with an initial proximity search but which may do anything thereafter.

How to record the associations made and the methods used to make them so that subsequent users can make use of that information

This third question is the one for which a solution is least apparent now. It is not usual for the results of association procedures to be stored in a publicly-accessible manner (other than as part of the results written up into scientific papers), nor is there a clear way to describe, say, the details of a likelihood ratio procedure using metadata that can be readily interrogated. This latter issue would seem to be part of a much wider topic of how to express the provenance of derived data, and so may be addressed within other areas of escience.

Application to WP-A5.2

The first part of the WP-A5.2 association problem has already been addressed, under the auspices of WP-A4, where an X-ray/optical cross-match was performed using a ROSAT catalogue. This worked as expected, but the density of sources on the sky was insufficient to challenge the algorithm thoroughly. If WP-A5.2 is resurrected, when the first XMM-Newton source catalogue is available, it would be interesting to address some of these issues. For example, it would be illuminating to see how easy it is to write something like a likelihood ratio analysis as a stored procedure in a DBMS, and also see how the creation of a cross-neighbours table can speed up the fuzzy join that would precede that: the latter is, in fact, also likely to be investigated by the resurrected optical/near-IR pilot of WP-A5.1, to facilitate the association of SDSS and the SuperCOSMOS Sky Survey objects within the SDSS Early Data Release region.

-- BobMann - 14 Sep 2002

Edit | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r4 < r3 < r2 < r1 | More topic actions
 
AstroGrid Service Click here for the
AstroGrid Service Web
This is the AstroGrid
Development Wiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback