mysql> DESCRIBE usno; +-------+---------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------+---------+------+-----+---------+-------+ | uid | int(11) | YES | | NULL | | | ura | double | YES | | NULL | | | udec | double | YES | | NULL | | | rmag | float | YES | | NULL | | | bmag | float | YES | | NULL | | +-------+---------+------+-----+---------+-------+ mysql> SELECT COUNT(*) FROM usno; +----------+ | count(*) | +----------+ | 3000000 | +----------+This is a sample of the catalog exactly as issued by US Naval Observatory. The UID column is a unique identifier, and is important for what follows. What we also need is a separate table of positional information: this just needs to contain the UID and the PCODE value or values. Where the error circle around the source position covers more than one pixel (or cell) in the chosen sky pixellation (here HEALPix), there are correspondingly more rows in this table, which I have called USNOP. It looks like this:
mysql> DESCRIBE usnop; +--------+---------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +--------+---------+------+-----+---------+-------+ | uid | int(11) | YES | | NULL | | | upcode | int(11) | YES | | NULL | | +--------+---------+------+-----+---------+-------+ mysql> SELECT COUNT(*) FROM select count(*) from usnop; +----------+ | count(*) | +----------+ | 3476938 | +----------+The larger number of rows reflects the number of sources covering more than one pixel. Next we need to create two one-dimensional B-tree indices: one on the UID column of USNO, and one on the UPCODE column of USNOP:
mysql> CREATE INDEX upind ON usnop(upcode); Query OK, 3476938 rows affected (1 min 12.19 sec) mysql> CREATE INDEX uuid ON usno(uid); Query OK, 3000000 rows affected (1 min 0.56 sec)
mysql> CREATE TABLE gsc(gid INT, gra DOUBLE, gdec DOUBLE,,
-> pos_err FLOAT, mag FLOAT, mag_err FLOAT, mag_band TINYINT,
-> class TINYINT, plate_id CHAR(4), multiple TINYINT);
Query OK, 0 rows affected (0.07 sec)
mysql> LOAD DATA LOCAL INFILE 'home/cgp/data/gsc2.csv' INTO TABLE gsc
fields terminated by ",";
Query OK, 265149 rows affected (1.43 sec)
Records: 265149 Deleted: 0 Skipped: 0 Warnings: 2651490
mysql> DESCRIBE gsc;
+----------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+------------+------+-----+---------+-------+
| gid | int(11) | YES | | NULL | |
| gra | double | YES | | NULL | |
| gdec | double | YES | | NULL | |
| pos_err | float | YES | | NULL | |
| mag | float | YES | | NULL | |
| mag_err | float | YES | | NULL | |
| mag_band | tinyint(4) | YES | | NULL | |
| class | tinyint(4) | YES | | NULL | |
| plate_id | char(4) | YES | | NULL | |
| multiple | tinyint(4) | YES | | NULL | |
+----------+------------+------+-----+---------+-------+
10 rows in set (0.00 sec)
Next we create the second table of PCODE data, and import its data.
mysql> CREATE TABLE gscp(gid INT, gpcode INT); Query OK, 0 rows affected (0.00 sec) mysql> LOAD DATA LOCAL INFILE '/home/cgp/data/gsc2.pcode' INTO TABLE gscp fields terminated by ","; Query OK, 295358 rows affected (0.67 sec) Records: 295358 Deleted: 0 Skipped: 0 Warnings: 590716 mysql> DESCRIBE gscp; +--------+---------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +--------+---------+------+-----+---------+-------+ | gid | int(11) | YES | | NULL | | | gpcode | int(11) | YES | | NULL | | +--------+---------+------+-----+---------+-------+
mysql> CREATE TABLE pjoin SELECT gid,uid FROM gscp,usnop
WHERE gpcode=upcode;
Query OK, 380984 rows affected (23.22 sec)
The results look like this:
mysql> select * from pjoin limit 10; +-----------+-------+ | gid | uid | +-----------+-------+ | 225900001 | 51938 | | 225900001 | 51938 | | 225900001 | 51938 | | 225900001 | 52142 | | 225900002 | 12526 | | 225900002 | 12526 | | 225900003 | 51938 | | 225900003 | 52142 | | 225900003 | 51938 | | 225900003 | 52142 | +-----------+-------+Obviously there are lots of duplicates, where two sources both overlap the same two (or more) pixels. We can elimiate this duplication simply by adding the DISTINCT qualifier in the SQL:
mysql> CREATE TABLE pjoin SELECT DISTINCT gid,uid FROM gscp,usnop
WHERE gpcode=upcode;
Query OK, 299132 rows affected (17.66 sec)
Note that the SQL includes the DISTINCT qualifier to avoid the same pair of rows being included more than once in the output, which implies that the results are sorted to remove duplicates.
This join is likely to be performed by a sequential scan of the smaller table, GSCP, and an indexed look-up of the larger table, USNOP. It will scale linearly with the size of GSCP, and with some logarithm of the size of USNOP. If the B-tree has an average fan-out factor of, say, 100, then the scaling law is likely to go with something like log100Nrows, where there are Nrows in USNO. The sorting stage is likely to scale as log2Nout, where there are Nout rows in the output table.
Here is a sample of the first ten rows of PJOIN with duplicate rows removed:
mysql> SELECT * FROM pjoin LIMIT10; +-----------+-------+ | gid | uid | +-----------+-------+ | 225900001 | 51938 | | 225900001 | 52142 | | 225900002 | 12526 | | 225900003 | 51938 | | 225900003 | 52142 | | 225900004 | 52827 | | 225900005 | 59385 | | 225900006 | 26593 | | 225900006 | 26603 | | 225900006 | 26638 | +-----------+-------+
mysql> CREATE INDEX index pgid ON pjoin(gid); Query OK, 299132 rows affected (5.73 sec)
CREATE TABLE temp6 SELECT gsc.gid,gra,gdec,mag,mag_err,mag_band,class,plate_id,multiple, usno.uid,ura,udec,rmag,bmag FROM gsc,pjoin,usno WHERE gsc.gid=pjoin.gid AND pjoin.uid=usno.uid AND 2 * ASIN(SQRT( POW(SIN((udec-gdec)/2),2) + COS(udec) * COS(gdec) * POW(SIN((ura - gra)/2),2) )) <= 5e-6+pos_err ; Query OK, 358476 rows affected (42.22 sec)The stuff after the word
AND of course is the selection on great-circle distance. In practice we should be able to wrap this in a custom-function so it won't look so messy.
The total time taken for these three operations is 65.66 seconds, but the components scale in slightly different ways with the table size. This seems reasonable performance for a table of over a quarter of a million rows.
Notes: htmID,objID1,objID2,distance,type1,type2,Primary1,Primary2where the objIDs, the types (denoting star, galaxy, etc), the Primary (best of possible duplicates in survey overlap regions) entries come from the PhotoObj table. The finding of pairs of neighbouring objects for this table is performed using the HTM library. Given the position of an object in (copy 1 of) PhotoObj and a radius defining neighbours, it is simple to determine which HTM triangles fall within the search radius and work out which objects in (copy 2 of) PhotoObj lie within those HTM triangles. Once they have been identified, the distance entry in the "Neighbors" table is computed using the great circle distance formula. In the new SSS database, we have simply extended this notion to computing a "CrossNeighbours" table, recording SSS objects which are neighbours of each SDSS object. This means that cross-matches between the SSS and SDSS can be performed using queries like the following:
select [whatever] from supercos as sss, skyserver as sdss, crossneighbours as crossn where crossn.sdssid = sdss.objid and crossn.sssid = sss.objid and crossn.distancemins < 1.0/60.0where here we just match objects within 1 arcsec - clearly this could be replaced by a constraint on crossn.distancemins based on the error circles of the two objects. Clearly, this approach just shifts the computational burden from query computation time to the pre-computation of the neighbours tables. The once-only time cost of creating the neighbours table does take time when the tables are large, but the increased speed of query execution is quite dramatic - at least, anecdotally, as found during work on the SSS SQL Server database. WFAU will be undertaking further tests of this approach with the ~1TB SSS database, but it seems well suited to the case of large sky survey databases that are updated infrequently. For example, the Science Requirements Document for the WFCAM Science Archive (see http://www.roe.ac.uk/~nch/wfcam) obliges WFAU to maintain local copies of several large datasets (public SDSS, 2MASS, USNO-B, etc) that can be readily cross-queried with UKIDSS data held in the WSA. In such a case, where the external databases can be regarded as static, and the WFCAM database will be expanding via relatively infrequently data releases (which will generally expand sky coverage, more than update existing data), the precomputation of cross-neighbours tables, while time-consuming, can be performed without undue impact on the "live" database, as it can be undertaken in periods of low usage (e.g. overnight) leading up to a data release. More detailed investigations of this are being undertaken as part of the continuing development of the WSA. Of course, this does not solve the generic VO problem of cross-matching arbitrary catalogues on-the-fly, but it is only when this involves (one of a limited number of) very large databases that this is a significant computational problem, and it seems likely that, in practice, copies of the largest catalogues will be held together at some sites - whether in VO data centres specifically set up for the purpose of catalogue cross-matching, or, as in the case of the WSA, because science requirements on the very largest databases require them. -- BobMann - 14 Jan 2003
![]() |
Click here for the AstroGrid Service Web |
This is the AstroGrid Development Wiki |
|