Comments on the XMATCH function in ADQL
1. Cross-match function XMATCH
ADQL (Astronomical Data Query Language) contains a function called XMATCH designed to allow cross-matching of sources between catalogues. This is to be available in Full SkyNode but is not required in the Basic SkyNode.
I have not yet seen a detailed specification of the XMATCH function, but think I understand its working within the current implementation of SkyNode from various examples. It is used in WHERE clauses like this:
SELECT cols FROM table1 AS x, table2 AS y, table3 AS z WHERE XMATCH(x, y, !z) < 3
This means that searches for sources in catalogue
x which have a counterpart in catalogue
y but
no counterpart in catalogue
z (the exclamation mark is a negation operator), while the matching likelihood is equivalent to 3-sigma. This is a nice simple syntax, but it is not immediately evident to me whether cross-matching can be generally reduced to such simple terms, or whether other parameters are needed.
Let me start by raising what I think are some problems with this simple syntax.
1.1 Not quite a function
Although XMATCH has the appearance of a function returning a numerical value, actually it is not, and it requires the query parser to transform it into something else. It would be invalid, as far as I understand it, to specify something like:
WHERE XMATCH(x, y) < x.somecolumn
The significance value (3 in the first example) is, if my understanding is correct, an constant within the matching process and cannot be a field from one of the input tables.
I think that it would be much better to include the significance (or something equivalent) as a function argument; it could be included as the required first argument. This still allows for a variable number of table arguments to follow. The less-than function seems to be implied by the fact that you want to do a cross-match, and I cannot see that would make sense to allow other relational operators such as greater-than or equal-to here. I propose instead that XMATCH should be a true boolean function:
... WHERE XMATCH(someConfidenceLevel, x, y, !z)
SQL99 treats boolean as a first-class data type and I think most DBMS should be able to handle boolean functions. This should be at least as easy to parse as the existing notation.
1.2 Sigma value or confidence level
I am not convinced that a sigma value is the best way to express a matching probability since it is based on an assumption of normal errors. It is true that astronomers colloquially refer to 2.5-sigma results and the like, and many of us carry in our heads some rough equivlants between probability values and sigma, on the assumption that we are talking of errors which have a 1-dimensional Gaussian distribution (singlee-tail probability of 1e-3 being 3.09 sigma from the mean, etc.). Here, however, we are considering 2-dimensional error distributions. I confess that I do not carry the probability values of a given number of standard-deviations in my head for 2-d Gaussians, and I suspect the same may be true for many astronomers. In published work I think it is more common for such things to be expressed in probability terms, e.g. 90% error-boxes. I would propose, therefore, that if we need some parameter to specify the matching probability we go for a confidence level (false-alarm probability) in the ADQL XMATCH function.
1.3 No outer join
I think we need a notation for the left outer join, that is for including in the output table the details of all rows in the first table, whether or not a match was found, since the absence of a counterpart is often of scientific interest. I am not sure what notation should be used: in keeping with the simplicity of the
!z notation, maybe a suffix of say a plus-sign would be sufficient to denote an outer-join, i.e.
... WHERE XMATCH(someConfidenceLevel, x, y+) ...
I vaguely recall that Oracle's dialect of SQL used to have a similar notation for outer joins.
2. Scientific Requirements
I can think of three fairly distinct situations where cross-matching is required, but look forward to getting other contributions.
2.1 Cross-match user's own list with standard catalogues
If the user's own list consists of just one source position, then the problem reduces to that of a cone-search, for which the REGION function in ADQL has already been invented. If there are just a few sources, perhaps they could be handled by a series of cone-searches, and the results merged (e.g. using the table merging functionality which has been prototyped in Astrogrid).
If the user's list is large, then I think the best way to support this is for the list to be uploaded to a server already holding the standard catalogue of interest, and ingested into its DBMS. This case then becomes essentially the same as section 2.2 below. This obviously depends upon the existence of
MySpace or some other mechanism which allows users to have space on the server, software to upload the dataset, convert it to a usable format, ingest it, create the necessary bounding-box column, etc. Such considerations are well beyond the scope of ADQL, but such facilities are important and must not be neglected by the VO.
There may be source lists of a size between these two cases, but at present I doubt if the effort of catering separately for such small to medium sized lists can be justified. For the moment I will assume that any list too large to be decomposed into a few cone-searches can be uploaded, so coming under section 2.2 below.
2.2 Cross-match one catalogue (or part of one) with another
If the two catalogues exist as tables within the same DBMS, then the spatial join functionality (extensively discussed in other wiki pages) will do this. The spatial join needs to find all sources which have error-regions which overlap. There is, in many cases, more to finding counterparts than this: there may be ambiguous or uncertain identifications which need other properties such as spectral type, proper-motion, or redshift to resolve. But all this requires specialist knowledge which only the astronomer can supply. The job of the cross-matching service is to come up with all possible cross-matches which are bound to include some false positives, and let the astronomer weed them subsequently.
The basic cross-matching process can be a substantial one: if our two catalogues
x and
y contain
Nx and
Ny sources, then in principle there are
Nx×Ny possible combinations to check. If each N can be up to 10
9, then this makes up to 10
18 possible combinations. In practice the problem becomes manageable only with spatial indexing. In database terms, the cross-match is a spatial join. If we want to include in the output cases in which a source in table
x has no counterpart in table
y then we need a
LEFT OUTER JOIN.
Since source positions in catalogues are always subject to uncertainty, each position in 2-dimensional coordinate space turns into a small error region. The join criterion is then the overlap of these error regions, which are typically ellipses.
2.3 Matching beyond error-regions
When seeking counterparts the overlap of the individual error regions is generally all that matters. But there are a few situations where one might want to cross-match with a larger permitted distance between objects. This arises particularly when searching for clusters of stars or galaxies, or for evidence of gravitational lensing. Recently a colleague suggested another case: he was interested in finding examples of a particular type of active galaxy where it was fairly close to a bright star, so that the latter could be used as a phase-reference source for a telescope with active optics. The permitted offsets in such searches may be from tens of arcseconds up to a few arcminutes, certainly much larger than the error-regions of the sources involved. In some cases when searching for clusters of objects one might only need to use one catalogue, i.e. carry out a spatial join of that table with itself. In that case you would certainly want to exclude from the output every match of a source with itself: you are interested only in finding its near neighbours.
3. Implementing a Spatial Join
A join is intrinsically an N
2 operation, so its efficient implementaton depends on indexing. It seems self-evident to me that when one wants to perform
spatial searches and joins on objects in a DBMS then one needs a DBMS which implements
spatial indexing. Fortunately, spatial indexing is included in PostgreSQL and MySQL, and IBM recently announced that the next version of DB2 will have it included (it was previously an add-on). For several other products such as Sybase-ASE and Oracle there is an optional package for spatial data handling. I think that the only well-known DBMS without a spatial data option is Microsoft SQL Server. Jim Gray and colleagues have gone to some lengths to try to establish that one can manage with only 1-dimensional indexing, either by using pixel-based methods such as HTM or HEALPix, or by dividing the sky up into numerous strips in declination. This is indeed feasible for simple spatial searches, such as the cone-search, see for example
There Goes the Neighborhood: Relational Algebra for Spatial Data Search. But it does seem rather complicated, and I note that they were able to perform a search efficiently only by getting HTM functions built into SQL Server 2005, which I presume won't be available to the rest of us for some time. No mention is made in that paper of performing a spatial join. To perform a spatial join efficeintly it needs to be carried out within the DBMS itself, which means its query language must include a spatial overlap operator or function.
For example, in Postgres there is a spatial overlap operator (
&&) which returns a boolean result; in MySQL there is a function called
MBRIntersects which returns an integer 1 or 0, and in DB2 there is a similar function called
ST_MBRIntersects. A spatial join can be implemented using a pixel-code index, but again it is considerably more complicated and slower. A suitable algorithm was outlined in
SkyIndexing, and the relative performance reported in this
cross-matching report.
A great many join algorithms have been proposed, and several of these have been used to implement a spatial join. The most common method, and probably the simplest, is the indexed loop join. The DBMS loops through all the rows of table one, and for each uses the index to look up all possible matching rows in table two, then outputing the results from each possible combination. This requires that a spatial index be created only for the second table. For the basic inner join the results do not depend on which table comes first, so it makes sense to have the smaller table as the one accessed sequentially, and the larger one accessed via its spatial index. For the most common outer join, the left outer join, the order does matter, as the output includes
all rows from table one, but only those with matches from table two. In a data archive containing a number of source catalogues as tables, therefore, it will make sense to create spatial indices on all of them, so that there are no artificial constraints on which can be joined with which.
The indexed loop join algorithm can, of course, be implemented in a script external to the DBMS: this allows it to be used, potentially at least, to join a table with one resident on some external system. I have tested this using Postgres and its external access package, dblink. The indexed loop join can be made to work, but I found that it runs nearly an order of magnitude more slowly, even on a local machine, than when performed internally by the DBMS. When performed over the wide-area network there is an additional speed penalty from the latency of the links. A preliminary discussion of the issues of performing joins over the network can be found at
DistributedCrossMatch.
There are a few other practical obstacles to implemeting a spatial join, even if a spatially-enabled DBMS is available.
3.1 Error region and confidence level
The size of the error region essentially depends on the confidence limit you want to use. Typical choices by astronomers are to extend the region out to the point at which the probability that the sources is actually in the error-region is 90%, or 99%, or even 99.9%. Smaller regions return fewer false positives, but lose some actual matches. The confidence level really needs to be chosen with knowledge of the likely false-alarm rate, which depends on the density of objects. This varies over the sky, especially with galactic latitude in many catalogues.
The shape of the error-region is typically an ellipse, since the error in the measured position is often estimated separately along two spatial axes. In some cases these are aligned with RA and Dec, and quoted as errors in these values; in others then the position angle of the ellipse with respect to the coordinate axes is also needed. The error ellipse rarely has a large eccentricity: inspection of several catalogues shows that the major axis is rarely more than twice the minor axis.
Spatially-enabled DBMS usually support a variety of simple geometric objects such as lines, circles, and polygons. For efficiency the simplest 2-d shape is that of a rectangular box: for more complex objects the system usually puts a box around them, and creates an R-tree (or similar index structure) around the set of boxes.
A spatial join is normally performed in two stages, called filter and refine:
- Find rows where the bounding boxes overlap (this will include some false positives);
- This set of rows is examined to see whether the actual error regions (ellipses or whatever) actually overlap, if not the row is discarded from the output.
This two-stage process seems to be the norm in spatial joins. A similar two-stage process is needed when using pixel-based methods (e.g. HTM or HEALPix): first one finds rows where the sets of pixels overlap, and then refines the query to select only the actual overlapping error regions.
Using SQL one can do this two-stage process explicitly with a query such as (using Postgres notation and assuming circular error regions for simplicity):
SELECT variousColumns FROM table1 AS x, table2 AS y
WHERE x.errbox && y.errbox
AND distance(x.ra, x.dec, y.ra, y.dec) < (x.errorRadius + y.errorRadius)
AND ...ww2
For elliptical regions the expression would be a good deal more complicated, but the steps are essentially the same.
Lines 2 and 3 could be generated easily enough from
XMATCH(x,y) provided that the
errorRadius parameters have been set up with suitable values. But the question is: does one size fit all? If we want to give the user of the XMATCH function the choice of confidence limit (3-sigma or 99% or whatever) then things are a bit more complicated. For a start, the match will not work using a confidence limit which selects regions larger than the bounding boxes. Secondly, we need to set up the columns containing the
errorRadius parameter (or equivalent) with a defined confidence level (1-sigma, or 67% or whatever); then it should be possible to have the query generator take the confidence level parameter and adjust the threshold in the refinement stage.
3.2 How to support matches beyond error-regions
Can we also support queries which seek clusters of objects, where positional offsets are much larger than the error-radius. One possibility is to decide upon an upper limit to the offsets (10 arcminutes perhaps?) and then create bounding boxes of this size, and indices to suit. The first stage of the spatial join will then come up with a much larger of potential matches, which will involve substantially more work by the refinement stage. Clearly this would work, but it would impose a speed penalty on the more common case of cross-matching within error-regions.
A way around this would be to have two box-type columns in each table, one surrounding just the error region, and another much larger to support clustering queries; a spatial index could be create on each.
As noted earlier, clustering queries will sometimes take place using just one catalogue, i.e. a self-join. The results then should not include the match of a source with itself, only with its neighbours. This requires a special condition on the spatial join. This could be implemented by having a unique identifier in a separate column and with a suitable join condition. In Postgres the syntax would be something like this:
SELECT cols FROM somecat AS a, somecat AS b
WHERE a.errbox && b.errbox
AND a.ident <> b.ident ...
At least I think so, I haven't tried it yet.
3.3 Proper motion and other problems
When comparing catalogues of stars observed at different epochs the proper motions may affect the positions. A few catalogues include proper motion estimates for many nearby stars (e.g. USNO-B) from which, in principle, corrected positions for the other catalogue or for some standard epoch can be computed. In practice it will be difficult to do this in a generic way, since proper motions are provided with many different physical units, with the SI unit for angular velocity (radians/second) never used.
3.4 Great-circle distance function
The user will often want to have the actual offset between the position in table1 and table2 as a column in the table of results. This can be left to the user, but the expression is complicated, and the simplest expression behaves badly with small angular offsets (the haversine formula solves this). I suggest that a great-circle-distance function be provided within ADQL. It could then be used something like this:
SELECT x.ra, x.decl, y.ra, y.decl, GCDIST(x.ra,x.decl,y.ra,y.decl), whatever
FROM xray AS x, optical AS y
WHERE XMATCH(x,y)
AND whatever-else ;
4. Non-match Queries
The aim here is to identify sources in table1 which have no counterpart in table2. There are certainly scientific problems where this is important, but it is a facility that will have to be used with care. The main problem is understanding the coverage of the second catalogue. Catalogues arising from systematic surveys should have fairly unform coverage of the sky (except perhaps within stated limits) but there are still holes, for example from things like diffraction spikes from bright stars, passage of spacecraft overhead during the exposure, cosmic-ray hits, etc. Many catalogues, however, arise from analysis of individual patches of sky, for example the 1XMM and INT-WFS catalogues in our Grid Data Warehouse samples. Here absence of a counterpart is the norm, as only a small part of the sky has yet been covered. Even within regions apparently covered, there may be small missing patches, e.g. arising from the gaps between CCD chips.
The Data Modelling group of the IVOA have been devising notations and mechanisms for specifying sky coverage, but I think it will be some time before these turn into operational systems.
Implementation within a DBMS may require some experimenting to determine the best algorithm. The most obvious way is to do first a LEFT OUTER JOIN and then separately an INNER JOIN and then subtract the second set from the first. There may well be better solutions.
5. Distributed Cross-match
As far as the user is concerned, if the VO achieves its aims, it should not matter whether the catalogues to be cross-matched are local or remote. In practice joins across the network are difficult to perform, and much slower. Some thoughts on the subject have been set out in
DistributedCrossMatch. The problem becomes even more complicated if we need to support clustering queries and non-match queries as in sectons 3 and 4 above over the wide-area network.
Appendix: Notes on positional errors in catalogues
In order to work out what a user would be likely to want in a cross-matching service, I checked the catalogues currently on hydra/cass123 to see how their positional errors were specified. They turn out to be rather diverse.
- 1XMM: error region assumed to be circular, with column
radec_err giving 1-sigma error in arcseconds. The documentation suggests using a 3-sigma error radius for 99.97% confidence. Cross-matches with other catalogues are performed with a 15 arcsecond radius, but only some matches with an offset beyond 6 arcseconds are accepted; the error-radius varies a lot with source flux, up to tens of arc-seconds in a few cases.
- FIRST: error region is an ellipse, but the radius in arcseconds along each axis of the ellipse is derived from a formula involving several other columns, the position angle is given in column
pos_posang. The FIRST radii, with 90% confidence, of the error ellipse are given by the following formulae:
error_maj = extension_fwhm_maj_fit * (0.05 + phot_flux_error / phot_flux_peak) + 0.05))
error_min = extension_fwhm_min_fit * (0.05 + phot_flux_error / phot_flux_peak) + 0.05))
- INT-WFS: error region is an ellipse. Column
gaussian_sigma gives the 1-sigma error in arc-seconds, ellipticity gives the ellipticity of the ellipse, and position_angle gives its position angle in degrees.
- 2MASS: error region is an ellipse, columns
err_maj and err_min give the 1-sigma errors along the major and minor axes of the ellipse in arcseconds, with column err_ang giving the position angle in degrees.
- USNO-B: error region is an ellipse, but aligned with the coordinate axes. columns
sra and sdec give 1-sigma errors along the RA and declination axes in milli-arcseconds (in the original distribution). Errors are typically 0.1 to 0.5 arcseconds.
--
ClivePage - 14 May 2004
non spatial joins
what are the issues with non-spatial joins? If i wanted, say, to find objects which were 'red' (from opt/ir colours) and apparently 'fast moving' from proper motion measurements.
--
NicholasWalton - 15 May 2004
Well that is quite similar to the requirements of the Brown Dwarf search, analysed in BrownDwarfSearch and BrownDwarfQuestions. Assuming you start with two tables, one with opt/ir colours, and the other with proper motions. You join these to find counterparts, so for each star you now know both opt/ir colours and proper motions. Then you select the objects of the required redness and speed. You can reduce the effort involved by selecting relevant subsets before the join, e.g.
- Select from spectral table e.g. twomass, only stars with the required redness
- Select from the astrometric table, e.g. usno, only stars with high enough proper motions
- Do a spatial join on these two subsets.
In principle these can both be done in one SQL statement, you specify the result (as usual with SQL) and hope that the DBMS has a good enough optimiser to do it the "right way", i.e. filter first, join second. For example
SELECT somecolumns FROM twomass AS t, usno AS u
WHERE t.opt - t.ir > 2 -- selects red objects in twomass
AND SQRT(u.pmRA*u.pmRA + u.pmDEC*u.pmDEC) > 50 -- selects fast objects in usno
AND XMATCH(t,u); -- cross-matches on position
What would be more difficult would be if the condition involved colour and motion in some more complex formula, e.g. you wanted to select in some elliptical region in colour-propermotion space. That could only be done after the join had taken place; I expect that a good DBMS would do this ok, but would be much slower, as the join would have to be done first and the selection(s) second.
--
ClivePage - 20 May 2004
Comments by Patricio Ortiz
About proper motions... One interesting application of the VO should be the capacity to discover fast moving object by comparing two or more catalogues, and that implies being able to find the no-matches in several tables. Other piece of software may pick up the results and try to figure out which objects correspond to a moving object (in whatever time-scale).
Output of the GCDIST is desirable for sure, but it is also desirable to obtain the delta.RA delta.Dec to recognize features triggered by diffraction spikes for instance or to study the spatial distribution of the matches in search for any systematics.
Regarding Nic's comment I think we need to think on how to apply "filters" to our queries. In a catalogue which I want to self-correlate, I may be interested in just the blue objects as my target list, but I'd like to find matches within the list of "red" objects (colours as defined by some restriction in B-V for instance).
The same applies to merging two or more catalogues where one would like to impose constraints to select the targets and the neighbours.
If one uses where for the cross match, can one still use it to filter other properties?
--
PatricioOrtiz - May 17, 20040830
Yes, as shown above. But this leaves SQL to decide on the order of things, you just tell it the conditions you want on the output. Usually that's ok.
--
ClivePage - 20 May 2004