Data Centre Spatial Indexing
Author: Clive Page, 2004 Jan 20.
Last revised: 2004 Feb 25.
1. Introduction
This document discusses how to provide existing astronomical data centres with efficient means to perform cone-searches and catalogue cross-matches. Many data centres store a wide variety of data products, but this document only discusses tabular datasets where celestial position is a key search criterion, i.e. source catalogues and observing logs.
There are two main services that data centres will want to make available to users of their source catalogues: cone searches and catalogue cross-matching. There are also two main ways in which the database can support this: either by using true spatial index such as an R-tree, or else a mapping of the sky to a set of pixel-codes (such as HTM, HEALPix, or Qbox).
2. Spatial Index or Pixel-code Method
A true spatial index may seem to be an obvious solution, but there are some problems to be overcome:
- Not all DBMS support spatial indices. But, as shown in the table below, most of them do.
- The need for an error-box column adds at least 32 bytes/row to the width of a table, and also requires a voluminous index. But the spatial join using a pixel-code requires additional tables to be created, using almost as much space.
- It takes a long time to populate the error-box column and create the index, a matter of days for large tables. But this only has to be done once (if you get it right first time).
- The R-tree has no guarantee of average or worse-case performance, unlike its one-dimensional analogue the B-tree. If the data are seriously skewed, the R-tree is known to perform poorly. In practice, however, most astronomical tables have positions fairly evenly distributed, and R-trees seem to work well.
- A R-tree built on (RA,DEC) does not solve the problems arising from the singularities at the poles and the wrap-around of coordinates at the zero-point of RA. There are work arounds for this: the error-boxes for sources near the poles can be stretched suitably along the RA axis, and it is possible to avoid ill-effects from the wrap-around problem by entering sources very near 0° or 360° of RA in the table twice, with error-box coordinates suitably modified.
- The R-tree selects rectangular regions, whereas the most usual search is for a circle (cone) or ellipse. This problem affects all spatial access methods, however, and the subsequent filtering is quite efficient.
The table below lists the main database management systems (DBMS) in current service noting whether they have any spatial indexing built in or as option:
| DBMS |
Spatial Index |
Comments |
| DB2 (IBM) |
Yes |
Optional spatial extender uses multi-level grid file |
| Firebird |
No |
|
| Informix (IBM) |
Yes |
Built-in, uses R-trees |
| Ingres |
Yes |
Option: uses R-trees |
| MySQL |
Yes |
Built-in to latest release, uses R-tree: not yet mature. |
| Oracle |
Yes |
Several options, including R-trees. |
| PostgreSQL |
Yes |
Built-in R-trees, optional pgSphere using GiST supports spherical polars |
| SQL Server (Microsoft) |
No |
|
| Sybase ASE |
Yes |
Optional, uses R-trees |
| Sybase IQ |
No |
|
| WCStools |
No |
|
Notes:
- WCStools was written by astronomers at the Harvard-Smithsonian Center for Astrophysics and is used by a number of current data centres. The larger tables are stored in declination strips with each strip sorted in RA, which allows cone searches at reasonable speed.
- Sybase-IQ is a data mining product based on column-oriented storage and advanced indexing methods but this does not include multi-dimensional indexing. I know of only one astronomical data centre using it at present, for a specialised application.
- Firebird is the open source release of Borland's Interbase: I am not aware of any use by astronomical data centres.
All of the remaining products, except one, have spatial indexing facilities built-in or available as an add-on. The exception is Microsoft's SQL Server, which has not been used widely, perhaps because it only runs on the Windows operating system.
3. Cone Search
A cone search is the general term for a search for sources in the catalogue around a given celestial position usually expressed as Right Ascension (RA) and Declination (DEC). Most often the area of interest is a circle on the celestial sphere, i.e. a cone in three dimensions, but the same database techniques are involved for rectangular boxes or other shapes. For simplicity the term
cone search will be used for all of these.
A cone search service is supported by many data centres in some form or other at present, but it is interesting to find that none of the the "Top Ten" science problems in the
AstroGrid list appears to involve a cone-search at all. Cone searches are also carried out on observing logs in order to find observational datasets that can be downloaded for later analysis.
An Astronomical Data Query Language (ADQL) has been proposed with a syntax for cone-search something like this example:
SELECT * FROM table WHERE REGION('CIRCLE J2000 123.45 12.34 0.01');
This is derived from ADQL-8 in
IVOA ADQL Version 0.7 dated 2004-01-07. There are a number of alternatives to CIRCLE, I think, but this is the one which corresponds to the good old cone-search; the J2000 element specifies the equinox, but we have no way of specifying or using that at present so it will have to be ignored; the other three strings are (I assume) the RA, DEC, and RADIUS in degrees. The syntax of this obviously has to be translated into SQL that each data centre DBMS can understand; some or perhaps all of this translation is highly DBMS-specific, and will have to be carried out by the code on the data access layer at the data centre.
3.1 Cone Search using a Spatial Index
3.1.1 Postgres
Postgres R-trees are built on rectangles, specified by giving the (x,y) coordinates of two opposite corners, so it is necessary to add an additional column, say
errbox to hold these values, and then create an R-tree index on this column. When this is done, the
&& operator can be used to select boxes which overalp a given box:
SELECT * FROM table WHERE errbox && box(point(123.44,12.33),point(123.46,12.35))
If what you want is actually a circular region, then it is necessary to filter these results with a little trigonometry, e.g.:
SELECT * FROM table WHERE errbox && box(point(123.44,12.33),point(123.46,12.35)) AND
DEGREES(2 * ASIN(SQRT(POW(SIN(RADIANS(decl-12.34)/2),2) +
COS(RADIANS(-12.34)) * COS(RADIANS(decl)) * POW(SIN(RADIANS(ra-123.45)/2),2)))) < 0.01 ;
This works well in practice, returning results in 30 milliseconds for circles of modest size.
3.1.2 MySQL
With MySQL spatial indexing works on polygons; the simplest thing seems to be to use a box with 4 sides: this as to be specified as 5 points with number 1 and 5 the same to make it a closed polygon (there is no other way to specify closure). The syntax for the polygon constructor is rather messy with alternating spaces and commas as delimiters, and the nested pairs of parentheses also seem to be necessary ((I tried without, and it did not work)), so the same selection as above would be expressed as:
SELECT * FROM table WHERE MBRIntersects(errbox,
PolygonFromText('polygon((123.44 12.33, 123.46 12.33, 123.46 12.35, 123.44 12.35, 123.44 12.33))') AND
DEGREES(2 * ASIN(SQRT(POWER(SIN(RADIANS(decl-12.34)/2),2) +
COS(RADIANS(-12.34)) * COS(RADIANS(decl)) * POWER(SIN(RADIANS(ra-123.45)/2),2)))) < 0.01 ;
3.1.3 DB2
DB2 also uses polygons, but I had to get help from an IBM expert to construct this expression, having failed several times to work out the correct syntax. This converts text to numbers and then back to text, before concatenating the various strings, which seems very cumbersome. The expert also pointed out that the query optimiser only uses an index if one uses "= 1" in line 6 of the expression (below), not "> 0", although one might think they were equivalent, given a function which can only return 0 or 1. This is because DB2 does not yet support the Boolean type, only introduced into SQL in 1999.
SELECT * FROM table WHERE ST_Intersects(errbox,db2gse.st_polygon(
'polygon(('||char(123.44)||' '||char(12.33)||','||
char(123.44)||' '||char(12.35)||','||
char(123.46)||' '||char(12.35)||','||
char(123,46)||' '||char(12.33)||','||
char(123.44)||' '||char(12.33)||'))', 0) ) = 1 AND
DEGREES(2 * ASIN(SQRT(POWER(SIN(RADIANS(decl-12.34)/2),2) +
COS(RADIANS(-12.34)) * COS(RADIANS(decl)) * POWER(SIN(RADIANS(ra-123.45)/2),2)))) < 0.01 ;
3.1.4 Other DBMS with spatial indexing
I have not yet tried the spatial indexing systems of other DBMS, but feel safe in predicting that each one will have its own idiosyncratic syntax. I will be pleased to document them here as information arrives.
3.2 Cone Search without a spatial index
At present the majority of existing astronomical data archive sites, as far as I can discover, do not use spatial indexing but implement their cone searches by indexing on one spatial axis only using a B-tree (or in the case of WCStools using a sorted strip of data). As noted below, this is much slower than access via a spatial index but is good enough for tables of modest size. The axis indexed is usually declination, as this does not suffer from the wrap-around problem at 0/360° as does RA. If one writes a query like that below, which selects a range on each axis and then ANDs this with another expression which selects only the points in the required circle, then it does not actually matter which axis is indexed, as the optimiser can handle either case:
SELECT * FROM table WHERE
ra BETWEEN 123.44 AND 123.46
AND decl BETWEEN 12.33 AND 12.35 AND
DEGREES(2 * ASIN(SQRT(POWER(SIN(RADIANS(decl-12.34)/2),2) +
COS(RADIANS(-12.34)) * COS(RADIANS(decl)) *
POWER(SIN(RADIANS(ra-123.45)/2),2)))) < 0.01 ;
This will also work, albeit much more slowly, on a table with neither axis indexed. It is important to note that even if there are indices on
both axes this provides no extra speed, except that a good optimiser will use the index providing the greater selectivity. The query given above should, at least, work on all DBMS which call their raise-to-a-power function POWER and not POW. This is also the name specified by JDBC, so should be fairly portable.
A performance comparison was shown in
DataFederationandDataMining section 6.1 - for a table of the size of 2MASS, compared to a simple B-tree index on declination, an R-tree index speeds things up by about 130 times; I have since discovered that one can get even better performance in Postgres using a clustered index: this re-orders the data in the table so that adjacent areas in the index (i.e. on the sky) are stored in contiguous areas of disc as far as possible.
3.2.1 Multi-column Indexing
Some DBMS allow the creation of a B-tree index on a combination of columns, or an expression involving two or more columns. It is not possible to create such an index on two columns of floating-point data, such as RA and declination, or if it is the index is pretty useless, since it would only work for exact match. Since it is unwise to test for exact matches with floating-point numbers, the usual way of using an index in this case is to use a range, e.g. using the BETWEEN AND syntax of SQL, or < and > operators. This will not work if the index is created on a combination of RA,DECL in the same way that sorting on such a combination would not be usable. A possible work around is to convert one coordinate to a scaled integer. Suppose we convert declination to the nearest integer, then it would be possible to create an index on a function such as "FLOOR(DECL),RA". This could be searched for an exact value of FLOOR(DECL), and a range of RA. I have not tried this, but I believe it is the method used by the SDSS system at JHU with SQL Server. The choice of a suitable integerisation is important: if the strips are too wide then each search brings in too many unwanted rows; if too narrow, then a high proportionof queries will involve searching a number of separate declination strips, which would reduce efficiency.
Recent versions of Postgres will, apparently, allow the creation of an index on an expression, whereas my MySQL manual suggests that it does not. This technique may therefore have limited applicability.
3.2.2 Pixel-code Methods
Given that DBMS are good at handling integers (indeed not much good at handling anything but strings and integers), many people have come up with the idea of covering the sky with a grid of pixels, numbering them, and using a DBMS to search for the required pixel by number. Simple grids on a spherical-polar coordinate space do not work too well, because there are distortions of scale leading to infinitesimally small pixels at the poles, and a wrap-around problem at 0/360 degrees of Right Ascension.
Various ways have been devised to cover the sphere with pixels of roughly uniform size which avoid these problems.
- HTM: Around 1983 Geoffrey Dutton invented the Octahedral Quaternary Triangular Mesh O-QTM for use in Geographical Information Systems. This uses triangular pixels, which have been called trixels, which vary in size but not by much. Dutton traces the idea back to Buckminster Fuller's Dymaxion constructions and before that to Albrecht Dürer's work on map projections, but he must be given the credit for realising its value in accessing geographical data and for doing the mathematics. Paul Barrett of GSFC brought this to the attention of astronomers by presenting an ADASS paper comparing this and the Z-order index (a well-known coding produced by bit-interleaving the cartesian coordinates). The SLOAN Digital Sky Survey team at Johns Hopkins University based their Hierarchical Triangular Mesh (HTM) scheme on the same tiling of the sphere, and code for the projection has been written in C, C++, and Java.
- HEALpix was invented by Kris Górsky at TAC in Copenhagen, and later used at ESO for the all-sky maping of data from COBE and other missions. HEALPix uses pseudo-square pixels, of uniform size; there are two possible pixel numberings. Code for the projection has been writtein in IDL and Fortran90 (and I have translated some of it myself into the pl/pgSQL language used by Postgres).
- Qbox was invented by François Ochsenbein at CDS to access source catalogues stored in the Vizier database.
For most purposes, and maybe all purposes, it does not matter which scheme is used, one just needs a fairly uniform coverage of the sky with pixels (or trixels) and an algorithm to translate an arbitrary (RA,DEC) to an integer. To keep the following explanations generic I shall call these integer numbers
pixel-codes.
The way to use these pixel-code methods with a common-or-garden DBMS is fairly obvious:
- Choose a size for your pixels (more on this below) and acquire (or write) a function to convert (RA,DEC) to pixel-code.
- Add a new integer column, say
pcode to the table.
- Populate the column with the pixel-code values derived from the RA and DEC columns.
- Build a standard B-tree index on the
pcode column.
To use the pixel-code enabled table in a cone search is slightly more complicated, as in general one is not searching for a single code but for all those covering the area of the cone (or other region). The best way to find this set of integers is not entirely obvious: I wrote a routine which did the following
- Cover the selected region with a grid of points (RA,DEC) at spacings slightly smaller than the pixel width, so as to ensure that no pixel can be omitted by accident.
- Convert each (RA,DEC) to the pixel-code and accumulate the results in an array.
- Weed the array of pixel-codes to remove duplicates and thus produce a set.
Because the pixels are numbered with nearby pixels having similar (if not always adjacent) numbers, the resulting set contains many sequences of integers, and it can be converted into a form which is more tractable for an SQL
SELECT statement. Here is a simple example, which arises (using HEALPix) with a cone specified as (123, 34, 0.01):
SELECT * FROM table WHERE
pcode BETWEEN 177466298 AND 177466300 AND
pcode BETWEEN 177499067 AND 177499068 AND
pcode BETWEEN 177531835 AND 177531836 AND
DEGREES(2 * ASIN(SQRT(POWER(SIN(RADIANS(decl-34)/2),2) +
COS(RADIANS(-34)) * COS(RADIANS(decl)) *
POWER(SIN(RADIANS(ra-123)/2),2)))) < 0.01 ;
Note that you need an way of converting the triplet (RA,DEC,radius) to the set of pixel-codes external to the DBMS, and then converting the list of results into an SQL statement like that above. This, once you have the necessary software, works quite well for small cones, such as the 0.01° used here. But the number of pixels and even ranges starts to become unwieldly as the cone radius increases, as I found:
These figures were for one of the two numbering schemes of HEALPix using NSIDE=8192, which gives pixels around 25 arcseconds across; I expect the results would be similar, though not identical, for the other scheme and for HTM.
Clearly the number of lines in the
SELECT statement is liable to exceed the limits of the DBMS when the radius exceeds about 0.1° or so. This is a serious problem, as users may well want to find sources in an area corresponding to an existing image, or a large object. The research here into the Hyades cluster started with a cone search with a radius of 13°.
One possibility that I have not yet had time to explore is that the set of pixels corresponding to a given sky region could be generated within the DBMS as another table; then it would just be necessary to
JOIN this to the main table to find the sources in the region. This, however, turns a simple selection into a join, so it not likely to be especially fast.
3.2.3 Pixel Size
With all of the pixel-based methods one can chose the level of sub-division of the sky and thus the typical pixel size. If the pixels are too large they do not provide much selectivity; if too small, a high proportion of all cones (or regions of interest) cover a large number of pixels so the SQL expression becomes cumbersome and involves repeated index look-ups. It turns out that if one uses the finest sub-division of the sphere in which the pixel-code can be stored in a regular 32-bit integer, then the pixels are around 20 or 30 arcseconds across. This is quite convenient for most astronomical purposes, since most cone-searches are likely to be only a few arcseonds in diameter, extending perhaps to a fraction of a degree in a few cases. For searches for cones (circles) smaller than the pixel size, then most searches will only involve one pixel-code, but a few circles could cross boundaries and involve up to 4 (HEALPix) or 6 (HTM) pixels. If the circles are larger, then any number of pixels could be involved.
3.2.4 Pixel numbering
There have been attempts to devise a numbering scheme in which adjacent pixels always have nearly sequential integers. If it really were possible to number all sets of nearby pixels with nearby integers, then a small circle on the sky could always be transformed into a small range of integers, so that a single range search (possible with B-trees) would locate all of them. Unfortunately this is in general impossible: it is possible to arrange numbering so that contiguous pixels mostly have nearby integer codes, but there are always fault lines across which the numbers change radically. Range searches under these conditions fail badly. Experiments earlier with Z-order indexing, in which the RA and DEC coordinates are converted to scaled integers, and the bits interleaved, produces a reasonably good contiguity in most areas of the sky. And attempts to do range searches are good most of the time. But the unavoidable presence of fault lines makes the worse-case terrible, pulling the average performance down to rather poor. Further details are given in
SkyIndexing. This does not look a promising line to follow.
3.2.5 HTM or HEALPix?
The only published comparison of HTM and HEALpix that I have seen is that of Wil O'Mullane
http://www.sdss.jhu.edu/htm/doc/womullan_082000.pdf. He found that HEALPix algorithm was nearly an order of magnitude faster. The fact that HEALpix was coded in Fortran whereas HTM was in C (which compilers can generally optimise less well because of potential storage aliasing problems) can hardly account for such a difference. The HTM code is in fact considerably more complex and needs recursive function calls. In order to test the pixel-code method, I needed to be able to populate a pixel-code column in an existing table, create an index on it, and to produce suitable SQL queries. This meant I needed a celestial position to pixel-code conversion function callable from SQL. The easiest way to do this was to use Postgres and its procedural language pl/pgsql: in this I was able to convert the HEALPix code (kindly made available by Kris Górsky and colleagues) into a suitable function very easily. The procedural language of Oracle is said to be very similar to pl/pgsql, so this function may be reasonably transportable. I think it would have been an order of magnitude more difficult to have converted the HTM code in the same way: it is always possible to simulate recursion with iteration and a stack, but for this one needs arrays, which are not supported in many SQL extension languages.
Tests on a catalogue of fairly modest size, the FIRST catalogue of only 771078 rows, showed that the time for a cone-search was around 20 to 30 milliseconds when using an index on declination only; this was reduced to 4 milliseconds or less when using a pixel-code index (although this does not include the very small amount of time taken to convert the cone position to the small set of pixel-codes).
3.3 Conclusion on Cone Search
For sites with tables of modest size, say up to a few million rows, a simple index on declination is good enough for cone searches. For larger tables the use of a spatial index is preferable, if the DBMS in use supports one. If not, the best alternative is probably to use a simple index on declination.
The use of a mapping function to integer pixel-codes works fine for small regions, although it needs some significant software external to the DBMS to support it, but it does not scale well to larger regions, which may well be required by users.
Using any pixel-code method, however, requires the following:
- A function has to be written for each brand of DBMS to convert a celestial position (RA,DEC) to an integer pixel-code callable from the DBMS (already done in pl/pgsql, but this is not especially portable).
- The data centre manager needs to convert each table to add an integer column.
- The manager then needs to perform an UPDATE, calling the pixel-code function, to populate this column.
- Finally the manager needs to create an integer B-tree index on the column.
- A second function has to be written to convert a cone (RA,DEC,RADIUS) to a set of pixel-codes; the number of elements in the set is unpredictable. This function would be needed only in the data access layer. Suitable functions in Fortran and C have been written; one of these probably needs to be recoded in Java.
- The data access code needs to be modified to generate code to convert a REGION function in ADQL to code which first calls the latter function to generate a list of pixel-codes, and then produce SQL to search for all of these. The SQL will also use the proper great-circle distance expression to select only rows with positions actually in the required cone on the sky.
4. Cross Matching
4.1 Cross-matching Catalogues
Cross-matching means finding which source or sources in a second catalogue are associated with each one in the first. This is an important basic operation in a whole range of scientific investigations, as it allows one to combine information from two wavebands or epochs; it figures in many of the "Top Ten" science problems. In database terms it corresponds to a spatial join between two tables each with columns of (α, δ) and with position error either explicit in another column or fixed for the whole table. Although important it is supported only in a few special cases by a few sites, probably because it is difficult to do in general and on a large scale.
Ideally there will be exactly one match in the second catalogue for each source in the first, but in practice one may find from zero to many. The case of zero matches may be important: this will be listed in the output if one does an OUTER JOIN, but not in the more usual INNER JOIN. If there are N matches, then other criteria will have to be used to discover the true counterpart: this is most easily done in the resulting joined table, providing it includes all the necessary columns of information from the two input tables.
It is true that cross-matching can be thought of as just a series of cone-searches, one for each source in the first catalogue, but I think that cross-matching has to be considered separately for two reasons. Firstly a cone search is, at any one site, just a single database search and it does not matter too much if it is excuted in seconds rather than milliseconds, whereas a cross-matching operation might involve millions or even billions of rows so it needs to be executed efficiently by the DBMS. Secondly the database operations involved are rather different: a cone-search corresponds to a SELECT operating on one table at a time whereas the cross-match corresponds to a
join between two tables. Joins are always slow operations, and the presence or absence of a suitable index can mean the difference between taking seconds and taking hours.
You
can do a cross-match by a series of cone searches, but it is terribly slow, and afterwards you have to concatenate all the separate sets of results. In theory a cross-match is just a type of join, and DBMS are supposed to be good at doing joins, indeed it is an essential element of the relational database paradigm, with lots of SQL features to support it. The problem is that we want to join not on exact match, but on approximate match: the overlap of the respective error regions (circles or whatever). We also need to have the DBMS query optimiser make use of an index to get good performance, especially if we have tables of more than a very small number of rows. Herein lies the problem.
The syntax of the cross-match isn't yet an IVOA standard, but what is proposed for VOQL (and what is already working in JHU's skyservice spparently) is something like this:
SELECT * FROM radiocat AS r, xraycat AS x WHERE XMATCH(x,r) < 3 ;
I have not seen a proper definition of the XMATCH function, but a document on JVOQL suggests that this means a match within a 3 arcsecond tolerance, while I have been told (Bob Mann, private communication) this in the SDSS SkyServer it means a match within 3 sigma. Obviously the function needs to be defined unambiguously before it can be agreed by the IVOA.
If these catalogues are stored as tables in a DBMS, with a spatial index on the appropriate geometric data object (errbox) representing the error region (or a box drawn around it), then the Postgres syntax is something like this:
4.1.1 Postgres
SELECT * FROM radiocat AS r, xraycat as x WHERE r.errbox && x.errbox AND
DEGREES(7200 * ASIN(SQRT(POW(SIN(RADIANS(udec-tdec)/2),2) +
COS(RADIANS(udec)) * COS(RADIANS(tdec)) * POW(SIN(RADIANS(ura-tra)/2),2)))) < 3.0 ;
Again Postgres has a simple basic syntax, with the "&&" operator producing a boolean result if the spatial object operands have an overlap. The rest of the trig is as messy as usual, but should be familiar by now. The cross-match radius has been set at 3.0 arcseconds, and the 7200 factor includes 2 (needed by the haversine formula) and 3600 to convert degrees to arc-seconds. But this is not quite what is required, as the && operator will return true only if the error regions overlap, and these may have been defined without reference to the 3 arcsecond limit.
I think this is something that astronomers will have to sort out before the XMATCH() function can be adopted officially. A number of possibilities come to mind:
- You might want to use a fixed maximum offset (as in the 3 arcseconds above) no matter what the errors on the individual entries in the source catalogues.
- You might want to match within the combined radii of error in the source catalogues, where each source in the catalogue (row in the table) has its own error-radius.
- You might want to match within some multiple of the combined error-radii, for example if the radius is a one-sigma value, you might want to match if they overlap when extended to 3-sigma.
- You might want to match within the combined error-radii plus some extra allowance for proper motion or errors of measurement.
I am not sure how we can allow for all these in the VOQL syntax. But let's assume we can devise a suitable syntax, implementing it could be tricky. If you just want to use the error radii in the catalogues, then the R-tree has to be constructed upon an ERRBOX column populated with appropriate values. If you want to increase this by multiplying or adding a constant, then the R-trees will be no use, and it will be necessary to repopulate the ERRBOX column and re-create the R-tree index.
One possible solution is to generate the ERRBOX column with exceptionally generous error regions, as large any astronomer could conceivably wish for. Then the initial selection using the "&&" operator will select too many matches, but the subsequent filtering stage will only select the ones that matter. This could be augmented, of course, with a selection involving the actual error radii. E.g. suppose we have columns called POSERR in each catalogue (measured in arcseconds) and we want to use a radius 2.5 times the combined error radius, then the SQL would be:
SELECT * FROM radiocat AS r, xraycat AS x WHERE r.errbox && x.errbox AND
DEGREES(7200 * ASIN(SQRT(POW(SIN(RADIANS(udec-tdec)/2),2) +
COS(RADIANS(udec)) * COS(RADIANS(tdec)) * POW(SIN(RADIANS(ura-tra)/2),2)))) <
2.5 * (r.poserr + x.poserr) ;
4.1.2 MySQL and DB2
The WHERE clauses look much the same except for the first part, which is for MySQL:
SELECT * FROM radiocat AS r, xraycat AS x WHERE MBRIntersects(r.errbox,x.errbox) AND ...
and for DB2 rather similar:
SELECT * FROM radiocat AS r, xraycat AS x WHERE ST_Intersects(r.errbox,x.errbox) AND ...
4.2 Cross-match User's Table
One of the important facilities that will be desirable is to allow a user of the system to upload a catalogue and cross-match it with one or more of the standard catalogues resident at the data centre. If a spatially-aware DBMS is used, this will involve the following steps
- Uploading the user's table, presumably in VOTable or FITS format (perhaps a myspace function).
- Creating the schema for suitable temporary table on the DBMS
- Importing the user's table to the DBMS, which may require format conversion to CSV since that is the only format accepted by most DBMS.
- Creating an additional column for the error-box
- Populating the error-box column with values based on the position and error-radius information in other columns in the table
- Creating a spatial index on this column.
- Performing a spatial join with the standard table, on which spatial index has already been created.
- Downloading the results in a suitable format (under myspace control, presumably).
This sequence of steps is not simple, but there is no reason why most of it cannot be packaged and automated for ease of use.
4.3 DBMS Without Spatial Indexing
It is important to appreciate that one what needs is (a) a method of efficient spatial access, and (b) a query optimiser which is capable of using this efficient access to perform the spatial join efficiently. The latter is a more serious problem.
I have experimented on a number of join possibilities using MySQL, Postgres, and DB2 and found that (with the exception of a true spatial join using a spatial data type and the appropriate function or operator) the only form of join expression which current query optimisers is an equi-join with a pair of integer or string columns involved. One cannot even get an optimised join when using a range of integers in the comparision, such as in this example or a variety of similar ones:
SELECT * FROM xcat AS x,rcat AS r WHERE ABS(x.intcol - r.intcol) < 3 ;
Note: I am using SELECT * in all these examples where it would usually be necessary to list actual column names, purely in the interests of brevity.
My conclusion is that the only joins which are feasible are proper full-blooded spatial joins, and integer equi-joins. The latter does not help in our cross-matching situation, even given the pixel-code functions suggested for cone-searching above, because the finite area of the error-box means that there can be a set of pixel-codes on both sides of the join, and making a join on a match between any member of this set, and any member of that set, is not something that these DBMS are designed to do.
The best alternative is the pixel-code method, outlined in section 7 of
SkyIndexing. This requires the creation of an additional table of pixelcode information for each table which is involved in cross-matching; these tables will be at least as long as the data tables with which they are associated, but need only contain two columns. The generation of a suitable auxiliary table for each standard catalogue can be performed when the system is set up. The generation of a suitable table for each user table ingested, requires additional steps to those listed in the preceding section. The join is also more complex, as it requires yet another temporary table to be created, and ends with a three-way equi-join. Provided enough disc space is available, however, there is no reason why this also cannot be automated. My evaluations so far suggest that the pixel-code method is much less scaleable than a true spatial join, because of the extra complexity, and the need for sorting to remove duplicates. It may be good enough, however, for joins on tables of small or medium size.
4.4 Conclusion on Cross-matching
When using a DBMS with spatial indexing enabled, we can write a join expression in suitable SQL which is executed efficiently if a spatial index is defined on these catalogues. It turns out that Postgres works fine when a spatial index exists only on table 2, I do not know if this applies to the other DBMS, it depends on the join algorithm used. We also need to refine the XMATCH function syntax, to make it flexible enough to do what astronmers want in practice, whatever that turns out to be. It may need to specify for each table a multiplicative value P, so that if the table contains 1-sigma errors the user can search out to, say, P sigma; it may also be desirable to add a constant tolerance value, Q arcseconds, to all matches to account for systematic errors or unexpected proper motions.
The cross-match operation could be performed on a DBMS without spatial indexing using the pixel-code method, but it is complicated and slower, especially for large tables, and cannot really be recommended.
There appears to be only one good general solution to this problem: since astronomers are working all the time with spatial data, if they want to use DBMS then they ought to use a DBMS which supports spatial data types and indexing.
Appendix: Pixel-code sizes
See
PixelCodeSizes
--
ClivePage - last revised 26 Feb 2004