r7 - 05 Aug 2003 - 11:15:49 - ClivePageYou are here: TWiki >  Astrogrid Web  >  DocStore > DataDocs > SpatialIndexing

Spatial Joins and Spatial Indexing Revisited

(Revised 2003 February 17)

The PCODE algorithm which I outlined in an earlier Wiki paper, SkyIndexing, (and in a short ADASS presentation) has some significant disadvantages. Perhaps the most obvious one is that it needs additional rows to be inserted in each astronomical catalogue which takes part in the fuzzy join. These extra rows might be regarded as contaminating what should be a read-only resource, and they also make it harder to use the catalogue for other purposes, as the duplicates have to be detected and ignored.

Now that I have thought about the problem a bit more, I'd like to propose a modification which does not have this disadvantage. It is not quite as simple, so I shall illustrate it with an actual example, using MySQL. I have two catalogues to join:

  • USNO - a sample of 3,000,000 rows from the USNO-A2 Catalog (covering about 1% of the sky).
  • GSC - a sample of roughly the same area of the Hubble Guide Star Catalog.

The scalability of the algorithm is obviously of interest: here I shall assume that USNO is a well-known catalogue already installed on our Data Warehouse (or wherever): so that the overheads of database ingestion and index creation only have to be done once. The GSC represents some smaller catalogue, submitted by a user, so that the time taken to import it, and generate its indices is of some interest. The timings reported here use MySQL 4.0.9-gamma, running on a 2.2 GHz Intel Xeon processor and Redhat Linux 7.2.

Step 0: Set up USNO

The USNO table will be assumed to be already imported to our DBMS, so the operations in step 0 only need to be done once however many times USNO is joined with other catalogues.

The USNO table looks like this (the DESCRIBE command is not standard SQL but convenient here):

mysql> DESCRIBE usno;
+-------+---------+------+-----+---------+-------+
| Field | Type    | Null | Key | Default | Extra |
+-------+---------+------+-----+---------+-------+
| uid   | int(11) | YES  |     | NULL    |       |
| ura   | double  | YES  |     | NULL    |       |
| udec  | double  | YES  |     | NULL    |       |
| rmag  | float   | YES  |     | NULL    |       |
| bmag  | float   | YES  |     | NULL    |       |
+-------+---------+------+-----+---------+-------+

mysql> SELECT COUNT(*) FROM usno;
+----------+
| count(*) |
+----------+
|  3000000 |
+----------+
This is a sample of the catalog exactly as issued by US Naval Observatory. The UID column is a unique identifier, and is important for what follows. What we also need is a separate table of positional information: this just needs to contain the UID and the PCODE value or values. Where the error circle around the source position covers more than one pixel (or cell) in the chosen sky pixellation (here HEALPix), there are correspondingly more rows in this table, which I have called USNOP. It looks like this:
mysql> DESCRIBE usnop;
+--------+---------+------+-----+---------+-------+
| Field  | Type    | Null | Key | Default | Extra |
+--------+---------+------+-----+---------+-------+
| uid    | int(11) | YES  |     | NULL    |       |
| upcode | int(11) | YES  |     | NULL    |       |
+--------+---------+------+-----+---------+-------+

mysql> SELECT COUNT(*) FROM select count(*) from usnop;
+----------+
| count(*) |
+----------+
|  3476938 |
+----------+

The larger number of rows reflects the number of sources covering more than one pixel.

Next we need to create two one-dimensional B-tree indices: one on the UID column of USNO, and one on the UPCODE column of USNOP:

mysql> CREATE INDEX upind ON usnop(upcode);
Query OK, 3476938 rows affected (1 min 12.19 sec)

mysql> CREATE INDEX uuid ON usno(uid);
Query OK, 3000000 rows affected (1 min 0.56 sec)

Step 1: Import GSC tables

The first step in joining GSC with USNO is to define the structure of the table and import the data from an ASCII (comma-separated value) file.


mysql> CREATE TABLE gsc(gid INT, gra DOUBLE, gdec DOUBLE,,
    -> pos_err FLOAT, mag FLOAT, mag_err FLOAT, mag_band TINYINT,
    -> class TINYINT, plate_id CHAR(4), multiple TINYINT);
Query OK, 0 rows affected (0.07 sec)

mysql> LOAD DATA LOCAL INFILE 'home/cgp/data/gsc2.csv' INTO TABLE gsc 
fields terminated by ",";
Query OK, 265149 rows affected (1.43 sec)
Records: 265149  Deleted: 0  Skipped: 0  Warnings: 2651490

mysql> DESCRIBE gsc;
+----------+------------+------+-----+---------+-------+
| Field    | Type       | Null | Key | Default | Extra |
+----------+------------+------+-----+---------+-------+
| gid      | int(11)    | YES  |     | NULL    |       |
| gra      | double     | YES  |     | NULL    |       |
| gdec     | double     | YES  |     | NULL    |       |
| pos_err  | float      | YES  |     | NULL    |       |
| mag      | float      | YES  |     | NULL    |       |
| mag_err  | float      | YES  |     | NULL    |       |
| mag_band | tinyint(4) | YES  |     | NULL    |       |
| class    | tinyint(4) | YES  |     | NULL    |       |
| plate_id | char(4)    | YES  |     | NULL    |       |
| multiple | tinyint(4) | YES  |     | NULL    |       |
+----------+------------+------+-----+---------+-------+
10 rows in set (0.00 sec)

Next we create the second table of PCODE data, and import its data.

mysql> CREATE TABLE gscp(gid INT, gpcode INT);
Query OK, 0 rows affected (0.00 sec)

mysql> LOAD DATA LOCAL INFILE  '/home/cgp/data/gsc2.pcode' INTO TABLE gscp 
fields terminated by ",";
Query OK, 295358 rows affected (0.67 sec)
Records: 295358  Deleted: 0  Skipped: 0  Warnings: 590716

mysql> DESCRIBE gscp;
+--------+---------+------+-----+---------+-------+
| Field  | Type    | Null | Key | Default | Extra |
+--------+---------+------+-----+---------+-------+
| gid    | int(11) | YES  |     | NULL    |       |
| gpcode | int(11) | YES  |     | NULL    |       |
+--------+---------+------+-----+---------+-------+

Step 2: Create PJOIN table

Because of the index already created on the PCODE column of USNOP, we can immediately join tables GSCP and USNOP on the condition that the PCODE values are exactly equal. This produces a table, which I have called PJOIN, containing all cells (pixels) which are common to the two tables. These pixels will refer to sources which are fairly near to each other, being in the same HEALPix pixel, but they may not be so close that their error-circles overlap. We can sort this out later.

mysql> CREATE TABLE pjoin SELECT gid,uid FROM gscp,usnop 
       WHERE gpcode=upcode;
Query OK, 380984 rows affected (23.22 sec)

The results look like this:
mysql> select * from pjoin limit 10;

+-----------+-------+
| gid       | uid   |
+-----------+-------+
| 225900001 | 51938 |
| 225900001 | 51938 |
| 225900001 | 51938 |
| 225900001 | 52142 |
| 225900002 | 12526 |
| 225900002 | 12526 |
| 225900003 | 51938 |
| 225900003 | 52142 |
| 225900003 | 51938 |
| 225900003 | 52142 |
+-----------+-------+

Obviously there are lots of duplicates, where two sources both overlap the same two (or more) pixels. We can elimiate this duplication simply by adding the DISTINCT qualifier in the SQL:

mysql> CREATE TABLE pjoin SELECT DISTINCT gid,uid FROM gscp,usnop 
       WHERE gpcode=upcode;
Query OK, 299132 rows affected (17.66 sec)

Note that the SQL includes the DISTINCT qualifier to avoid the same pair of rows being included more than once in the output, which implies that the results are sorted to remove duplicates. This join is likely to be performed by a sequential scan of the smaller table, GSCP, and an indexed look-up of the larger table, USNOP. It will scale linearly with the size of GSCP, and with some logarithm of the size of USNOP. If the B-tree has an average fan-out factor of, say, 100, then the scaling law is likely to go with something like log100Nrows, where there are Nrows in USNO. The sorting stage is likely to scale as log2Nout, where there are Nout rows in the output table.

Here is a sample of the first ten rows of PJOIN with duplicate rows removed:


mysql> SELECT * FROM pjoin LIMIT10;
+-----------+-------+
| gid       | uid   |
+-----------+-------+
| 225900001 | 51938 |
| 225900001 | 52142 |
| 225900002 | 12526 |
| 225900003 | 51938 |
| 225900003 | 52142 |
| 225900004 | 52827 |
| 225900005 | 59385 |
| 225900006 | 26593 |
| 225900006 | 26603 |
| 225900006 | 26638 |
+-----------+-------+

Step 3: Create index for GSC.

The resulting PJOIN table contains all the information about which sources in each catalogue are near each other, but of course just the ID numbers. Before we can flesh this out with the details, and find just those sources with overlapping error-circles, we need one further index:

mysql> CREATE INDEX index pgid ON pjoin(gid);
Query OK, 299132 rows affected (5.73 sec)

Step 4: Three-way Join.

The final step is involves a 3-way join, and can be done like this:

CREATE TABLE temp6
SELECT  gsc.gid,gra,gdec,mag,mag_err,mag_band,class,plate_id,multiple,
usno.uid,ura,udec,rmag,bmag
FROM gsc,pjoin,usno 
WHERE gsc.gid=pjoin.gid AND pjoin.uid=usno.uid AND
2 * ASIN(SQRT( POW(SIN((udec-gdec)/2),2) +
COS(udec) * COS(gdec) * POW(SIN((ura - gra)/2),2) )) <= 5e-6+pos_err ;

Query OK, 358476 rows affected (42.22 sec)

The stuff after the word AND of course is the selection on great-circle distance. In practice we should be able to wrap this in a custom-function so it won't look so messy.

The total time taken for these three operations is 65.66 seconds, but the components scale in slightly different ways with the table size. This seems reasonable performance for a table of over a quarter of a million rows.

Notes:

  1. You might want the actual inter-source distance as a column in the resulting table: I leave this as an exercise for the reader.
  2. The POW(x,2) function is used because most versions of SQL do not have an exponentiation operator; even the POW function is not standard SQL (hardly anything useful is) - so needs to be altered for some DBMS.

Creating the PCODE tables

The bit I glossed over was the creation of the tables with just ID and PCODE for each source. At present I have not worked out how to do this within the confines of a DBMS, because in practice they only speak SQL and it is a very limited language. The problem is that the algorithms which generate a PCODE, whether HTM or HEALPix or something else, are rather complicated. The HEALPix algorithm can be translated fairly easily into SQL, but the HTM algorithm is recursive, and I wouldn't even attempt it.

Then one needs to find out all the pixels which might overlap with the error-circle: this takes a bit more effort. I wrote my own code when I found that the one provided in the HEALPix library didn't quite do what I wanted. The problem is that the result of the function is not a simple scalar but a list of pixel numbers, usually a list of one, but sometimes two or four or perhaps even more pixels. So the function will not be easy to encode into a SQL-callable form.

The easiest work-around is to create the POSn tables by running a separate program on the CATn table; generating the results in plain ASCII form, then creating the table in the DBMS from the ASCII version. But it would be nice to find a way of doing this in pure native

This leaves the problem of how to use the PCODE tables for a simple table look-up, for example for the cone-search. Here we want the user to specify a triplet of values: RA, DEC, and RADIUS. The system needs to translate this to the PCODE value or values corresponding to the pixels in the circle, and then look up the appropriate rows in the table, first by doing an indexed lookup of the PCODE in the (UID, UPCODE) table, then having the UID looking up the corresponding details in the USNO table. If there were never more than one PCODE value it would be possible to write (in C or Java) a function callable from SQL. But sometimes there may be two or four, or even more pixels in the error-circle. None of the DBMS I have looked at so far seems to allow the return of a vector of values from an SQL-callable function. The only solution that I can see is that the cone-search function will have to be coded partly in a traditional programming language (such as Java) and then use JDBC to extract the information from the DBMS.


Comments from BobMann

This is similar to what we are doing in our new SuperCOSMOS Sky Survey (SSS) database, which, in turn, is just a simple extension of something that was done in the SDSS SkyServer. In SkyServer, spatial matching is aided by a "Neighbors" table, which records all objects in the SkyServer PhotoObj table (satisfying some quality threshold) that are within some distance of another such object. For each such pair of objects, the following information is stored:

   htmID,objID1,objID2,distance,type1,type2,Primary1,Primary2
where the objIDs, the types (denoting star, galaxy, etc), the Primary (best of possible duplicates in survey overlap regions) entries come from the PhotoObj table. The finding of pairs of neighbouring objects for this table is performed using the HTM library. Given the position of an object in (copy 1 of) PhotoObj and a radius defining neighbours, it is simple to determine which HTM triangles fall within the search radius and work out which objects in (copy 2 of) PhotoObj lie within those HTM triangles. Once they have been identified, the distance entry in the "Neighbors" table is computed using the great circle distance formula.

In the new SSS database, we have simply extended this notion to computing a "CrossNeighbours" table, recording SSS objects which are neighbours of each SDSS object. This means that cross-matches between the SSS and SDSS can be performed using queries like the following:

   
select [whatever]
from supercos as sss, skyserver as sdss, crossneighbours as crossn
where crossn.sdssid = sdss.objid and crossn.sssid  = sss.objid and 
crossn.distancemins < 1.0/60.0
where here we just match objects within 1 arcsec - clearly this could be replaced by a constraint on crossn.distancemins based on the error circles of the two objects.

Clearly, this approach just shifts the computational burden from query computation time to the pre-computation of the neighbours tables. The once-only time cost of creating the neighbours table does take time when the tables are large, but the increased speed of query execution is quite dramatic - at least, anecdotally, as found during work on the SSS SQL Server database.

WFAU will be undertaking further tests of this approach with the ~1TB SSS database, but it seems well suited to the case of large sky survey databases that are updated infrequently. For example, the Science Requirements Document for the WFCAM Science Archive (see http://www.roe.ac.uk/~nch/wfcam) obliges WFAU to maintain local copies of several large datasets (public SDSS, 2MASS, USNO-B, etc) that can be readily cross-queried with UKIDSS data held in the WSA. In such a case, where the external databases can be regarded as static, and the WFCAM database will be expanding via relatively infrequently data releases (which will generally expand sky coverage, more than update existing data), the precomputation of cross-neighbours tables, while time-consuming, can be performed without undue impact on the "live" database, as it can be undertaken in periods of low usage (e.g. overnight) leading up to a data release.

More detailed investigations of this are being undertaken as part of the continuing development of the WSA. Of course, this does not solve the generic VO problem of cross-matching arbitrary catalogues on-the-fly, but it is only when this involves (one of a limited number of) very large databases that this is a significant computational problem, and it seems likely that, in practice, copies of the largest catalogues will be held together at some sites - whether in VO data centres specifically set up for the purpose of catalogue cross-matching, or, as in the case of the WSA, because science requirements on the very largest databases require them.

-- BobMann - 14 Jan 2003

Edit | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r7 < r6 < r5 < r4 < r3 | More topic actions
 
AstroGrid Service Click here for the
AstroGrid Service Web
This is the AstroGrid
Development Wiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback