Report on Cross Matching Catalogues

1. Introduction

This document reports work carried out during 2003 July to September.  The main aims were:

  1. To compare the ease of use and scalability of three DBMS: DB2, MySQL, and Postgres.  We were particularly interested in their ability to handle spatial data and carry out a spatial join between two tables representing source catalogues.
  2. To compare the performance of two spatial join algorithms: (a) using a true spatial index e.g. based on an R-tree, (b) pixel-code method using a 2-d to 1-d mapping function.

The reasons for choosing these three DBMS to evaluate were:

The main attributes of each DBMS were evaluated, and for ease of comparison, are listed in the same order in the following sections.  Section 7.4 reports on the comparison of pixel-code and R-tree indexing for the spatial join.

2. Documentation

Postgres

Postgres has four on-line manuals: tutorial, administration guide, reference manual, and programmers guide.  These can be downloaded in PDF or HTML forms, or searched on-line.  There is also a nice interactive version, somewhat wiki-like, in which one's own comments or questions can be appended to the official contents.  I posted a question which resulted in several interesting responses by email.  There are also two good textbooks on-line.  Many Postgres books are also on sale; after browsing a number of them in Foyles, I purchased one: PostgreSQL Developers Handbook by Geschwinde and Schönig, as it was just about the only one to even mention advanced topics such as R-tree indexing.  I found this quite useful. The on-line help command lists basic syntax, but that is all.  Software support comes from a number of newsgroups, which are active and helpful. But because these newsgroups have not been properly registered in the Usenet hierarchy the Janet Usenet service has refused to make them available.  Actually one group, comp.dbms.postgres.hackers, is provided, probably by accident.  I had to use an alternative news server to see the others. The reference manual is particularly good in telling you not only the exact syntax and facilities of each SQL command, but the differences, if any, between it and Standard SQL-92.   Documentation on how to use R-tree indexing was somewhat brief, and I had to do some browsing of help files and experimenting to work out which operators on spatial objects were optimised by the SQL engine.

Mysql

There is one large reference manual which can be downloaded from the website of mysql.com.  I also purchased two books out of the many on sale: MySQL Reference Manual by Widenius and Axmark (the principal authors of the package), and Core MySQL by Atkinson.  These were both quite useful.  The documentation on geometric data types and R-tree indexing only appeared on line just before I started my tests, and was somewhat sketchy, but just enough to get me started. The on-line help facilities of the client are pretty basic.  Software support, for those using the free version, comes from groups on the sql.com website.  When I had problems loading spatial data I asked about external file formats; my question was not answered satisfactorily, perhaps because there are as yet very few users (but nobody from mysql.com bothered to reply).

DB2

The problem with DB2 is the sheer volume of documentation, which often defeated my attempt to find a particular item of information that I was sure must be present somewhere.  One volume contains the index, but reading an index as a PDF file is very tedious.  There are 49 main manuals, supplied as PDF files, several with over 700 pages.  I found it hard to search the PDFs on line, and printed out four of them and parts of a few others.  Even so these make a large pile on my desk.  I found that a user called Graeme Birchall had published on-line a book called DB2 UDB V8.1 SQL Cookbook, which contained a lot of useful information on advanced SQL techniques.  Also on-line are some useful IBM "Redbooks":  I printed out DB2 UDB Evaluation Guide for Linux and Windows and Up and Running with DB2 for Linux, not much over 300 pages each.  Between them they gave me basic installation instructions, though these proved confusing and inadequate.  I could find very little in the way of introductory or tutorial material.  I kept coming across phrases like "ask your administration team to install this additional product for you".  If only I had an experienced DB2 administration team on hand.  Jeff Lusted, who has some experience of using DB2, said that I should try to get hold of the HTML version of the documentation, as the cross-links make it much easier to search.  I could find manuals in HTML format only for versions of DB2 earlier than V8. The various DB2 clients provide some on-line help, but nothing that I actually found useful.  Because these documents were so difficult to use, I decided to buy a book published by IBM called DB2 Universal Database V8 Handbook for Windows, Unix, and Linux.  This was mis-delivered by Securicor to De Montfort University (although very clearly addressed here) which further impeded progress.  When it eventually arrived, the book was quite useful in itself, but also for its pointers on where to look in the huge manual collection.  We had earlier purchased an O'Reilly handbook called SQL in a Nutshell which is a compilation of SQL dialect variations.  It is a pity that it covers the quirks only of Oracle, Microsoft SQL Server, MySQL, and Postgres, ignoring DB2.  Perhaps this reflects the low profile of DB2 in the market.

It was not clear to me whether, as someone not paying for the product, I qualified for direct support from IBM.  There are, however, some Usenet groups covering DB2, including one specifically for the DB2 Spatial Extender.  When I asked a question, I got helpful answers from two members of  IBM staff, and one person who appeared to be just a helpful user.  The very low traffic on this group suggests either that the Spatial Extender has very few users, or that nobody has problems with it.

3. Installation, Configuration, and Tuning

Postgres

I found the RPM for Redhat Linux, downloaded it, and installed it without problems.  A few minutes work with the manual showed me how to disable all the access restrictions and set the server to use our RAID array for all its files.   I read a number of documents on tuning Postgres, and made a few changes when I found it was using only a rather small proportion of our one gigabyte of memory, but never managed to detect any significant speed improvement.  Unless I missed something important, it looks to me as if Postgres comes reasonably well tuned.

MySQL

I found the RPM for Linux, downloaded it, and installed it without problems.  But configuring it took a lot longer: the RPM was set up to put all its database files on the /var partition which would not have enough space.  The server appeared to have a command-line option to change this, and also an environment variable. I tried both of these in various combinations without effect.  Eventually I discovered in the manual that the only way to change the disc usage setting was to download a source-code version of MySQL and compile it myself with an altered configuration file.  This seemed daunting, and I searched for help on the newsgroups.  As nearly always, the problems I encountered with all three DBMS were common ones, and very frequently asked on the newsgroups.  Why the software producers take no notice of these common problems I cannot understand.  From answers in the newsgroup I discovered that there was in fact a work-around without re-compiling: create a database, move the entire directory to the location I really wanted (this needed root and lots of file permission changes), and then insert a soft-link from the old location to the new one.  This worked.  I did not find any mention of tuning in the MySQL documentation, and did not do any.  It seems to come ready to use.

DB2

Downloading DB2 RPMs requires logging in to the IBM Scholars (sic) web-site and entering one's user-name and password several times over.  Eventually you find a huge list of DB2 products to download.  Patricio Ortiz had downloaded V8 of the Enterprise Server earlier, so initially I used this.  But (as noted below) after over two weeks of struggling with the package and finding that the administration server would not work properly, I decided that it would be easier to start again from scratch perhaps with a slightly less huge installation.   DB2 UDB comes in a variety of versions and installation kits, from Personal Workstation kit, to Enterprise.  We had already discovered that the Spatial Extender required version 8.0 or later on Linux, but the documentation was confusing about whether kit was needed. Eventually I downloaded the Personal Workstation kit for V8.1 because I could not find the Personal Developer Kit.  This was a huge RPM, around 400 MB, and it took at least five attempts to get a complete download. 

Installation was then something of a nightmare.  For a start it overwrote some files from the earlier installation which I wanted to keep.  It requires JDK v1.3 whereas we already had v1.4.1 installed, and I kept having to change environment variables to get the old version picked up.  DB2 also requires three specific user-names to be set up: db2inst1, db2fenc1, and dasusr1.  The installation wizard claimed it would create these users, but it did not: apparently on systems running NIS this has to be done by hand.  I spend several hours trying to get my own user-name granted sufficient privileges to be able to use DB2, but failed, but in the process messed up some of my own shell's settings.  The new users were set to use Korn shell, which was unfriendly.  I eventually carried out all my tests of DB2 while logged in as db2inst1, and did cuts/pastes from/to my own account for logging etc.  It would be excessively tedious to report all the problems I had, no doubt some due to my own lack of experience, but I suspect it always takes a great deal longer to install DB2 than, say, Postgres.

The relatively slim IBM book on DB2 contained three whole chapters on tuning, so it is obviously an important topic.  On quickly browsing these, it appeared that transaction processing was the main area needing to be tuned.  Since I ran out of time and had no interest in transactions, I made no attempt to tune or optimise my DB2 queries.

4. General Usability

Postgres

It was easy to start and stop the server, and I could do this without logging in as root.  The client was equally easy to start and stop.  A long-running query could be cancelled just by typing ctrl/C - this caused the current query to be rolled back if it was doing an update, or otherwise stop quickly and cleanly.  There seemed to be no ill effects from aborting queries.  The command-line interface, psql, made it easy to recall and edit commands.  Multi-line commands could be entered fairly easily.  The error messages were good, telling me what was wrong and which part of the SQL statement had caused it.  There was a command \timing to toggle timing information given when each operation concluded.  The user-interface allowed command-line recall and editing, and external SQL scripts.  Although I found Postgres to be generally satisfactory, there were a few quirks:
It is notable that the most common problem posted to all DBMS newsgroups appears to be a complaint that the DBMS is not using an index when the user expected it to be.  The EXPLAIN command in Postgres at least gives some information on how it thinks any given query is best evaluated, and some tools to influence its optimiser.

Postgres allows user-defined functions to be specified in a number of languages, including its own procedural version of SQL called PL/PGSQL, and used easily in SQL queries.

MySQL

It was necessary to log in as a privileged user to start the server, but, oddly, not to stop the server, nor start/stop the client.  Aborting a query did not seem to be possible without crashing the client, but it was usually possible to re-start it without serious ill-effects.  The SQL implemented by MySQL is less complete and less standard than for the other DBMS, though every release brings improvements.  Some sub-selects now seem to be supported.  There are various back-ends: the one I used was the default one which does not fully support transactions with roll-back, since that was not necessary for our requirements.   Data types were a problem: on Postgres and DB2 REAL is a 4-byte type, but FLOAT an 8-byte type; with MySQL the reverse was true.  I found that the only way to get a 4-byte and 8-byte floating type for the same columns in MySQL and Postgres was to use non-standard types called FLOAT4 and FLOAT8.  MySQL provides the elapsed time of each query by default.  The user-interface allowed the usual command-line editing and recall, and use of external script files.  MySQL also provides an EXPLAIN command, although the output is not quite as easy to interpret.  There does not seem to be any way of influencing the query optimiser, should the explanation be unsatisfactory.

MySQL only supports user-defined functions if written in C, and a privileged account is then needed to link the compiled version with the MySQL executable.

DB2

There are three different user-interfaces, all of them with their own limitations:
Apart from grappling with these hostile user-interfaces, other problems I had with DB2 include the following:
There is no EXPLAIN command in DB2, but there is a separate product called Visual Explain.  I did not have time to try this out.

DB2 allows user-defined functions in several languages, including a procedural version of SQL.

5. Data Loading

Postgres

Loading data was simple, especially since the 2MASS data files were designed for use with Postgres.  I saw in the manual that a data load operation is performed within a single transaction, so it is all rolled back if anything goes wrong, but I saw no errors.  A typical load command is:
COPY mytable FROM '/home/cgp/db/data2/psc_baa.csv' WITH DELIMITER '|' ;
Nulls were represented by \N in the 2MASS files, the Postgres default, and I used this in the data files I prepared myself from the raw USNO-B distribution.

MySQL

Loading non-spatial data was just as easy, since MySQL seems to use the same defaults as Postgres.  The load syntax is not something that the designers of SQL have ever bothered to standardize, and MySQL requires a slightly more verbose syntax:
LOAD DATA INFILE '/home/cgp/db/data2/psc_baa.csv' INTO TABLE mytable FIELDS TERMINATED BY "|" ; 

DB2

Here the serious problems began.  I searched the two volumes of SQL commands for a long time without finding a data load command, and the Redbooks were no help.  I suppose DB2 is almost always used in a transaction-processing environment where data is captured live, and hardly ever needs to be loaded from external files.  I finally discovered a separate PDF manual called Data Movement Utilities Guide and Reference a mere 429 pages long.  It eventually yielded up the secrets of the copy command.  I then found that it could not be used from the db2batch environment, or the db2 gui, only from the db2 command-line interpreter.  The typical syntax is:
LOAD FROM '/home/cgp/db/data2/psc_baa.csv' OF DEL MODIFIED BY COLDEL| INSERT INTO mytable ;
But data loads failed repeatedly.  At first I thought that it was the date/time fields which were causing problems: with 60 fields per line and hopeless error messages, it took me some time to eliminate each possibility.  Tests with samples of a few hundred or thousand rows sometimes worked fine.  Eventually it seemed that the nulls were not accepted.  The manual suggested (without being definitive on the subject) that there was no external representation of nulls, except for the absence of data between two field delimiters.  Unfortunately every such null-by-default gave rise to a warning message.  I left one data load running when I went home at night; the messages were still scrolling off my terminal screen at the rate of hundreds per second when I arrived the following morning.  I had to abort this run, which left the database in a mess, and eventually I had to reboot the machine.  Since all the nulls were in numerical columns (magnitudes) I found a work-around by using the Unix tr utility to convert each null into the value of "-1" which was out of the normal range.  By this means I finally managed to load some strips of the 2MASS data.  I did not get around to converting these values back into nulls, though I suppose that an UPDATE statement would have been able to do that.  Even so each load operation generated lots of incomprehensible messages, of which this is a typical example:
SQL3501W  The table space(s) in which the table resides will not be placed in 

backup pending state since forward recovery is disabled for the database.
Although I could not get timing information as with other DB2 commands since the db2batch facility would not load data, the LOAD command reported the clock times at the start and end, so the performance could be measured.  The only good point of DB2's data loading was that it was fast, and I noticed that it used the cpu time of both processors.  The times taken to load 5.1 million rows of 2MASS were as follows:

Postgres
MySQL
DB2
116 secs
205 secs
44 secs

Given that nulls are an essential element of nearly all astronomical tables, and that finding a suitable out-of-band value to represent them is a non-trivial undertaking, I think that if it is true that DB2 has no satisfactory way of representing nulls, this omission forms a very serious obstacle to its continued use.  It's nice that DB2 is fast, even better if it did the job we wanted it to do.

6. Spatial Indexing

Postgres

Loading spatial data from a text file required a slightly odd format with extra parentheses, but I had no trouble in loading the required spatial column, nor in writing a program to convert USNO-B data to the required form.  The basic spatial object is a rectangle, specified by the coordinate pair (x,y) of two opposite corners, and one can create an R-tree index on these bounding boxes.  The SELECT statement can use operators (such as &&) to test for overlap of the rectangles.  It was easy enough to read the rectangle coordinates from an external text file, or to generate the rectangles on-line using an UPDATE statement (which is obviously simpler).  There is little to report here: spatial indexing worked as expected, and was reasonably fast.

MySQL

MySQL has only added spatial data support in its latest (alpha) release, and it shows many omissions which will no doubt be rectified over in future releases.  The software is aimed at the GIS market, and uses R-trees for indexing.  The documentation was very sketchy, and some trial and error was needed to discover what actually worked.  The first problem was to load the data, as no external format for the spatial data type was documented.  I asked in a support newsgroup, but got no answer.  I think this may be a feature still missing.  The only alternative was to use INSERT statements to load the spatial data, but loading one row per INSERT would have taken far too long.  I found by experiment that I could not load a large table in a single INSERT statement, but that 1000 rows/INSERT appeared to work.  I therefore write a new utility to convert the USNO and 2MASS data files into the appropriate SQL INSERT statements.  These were very verbose, inserting a single row involves this:
insert into us values
(0.000286,0.07725278,4.06E-05,NULL,19.78,20.59,19.77,17.34,
polygonfromtext('polygon(( 0.00084 0.07781, -0.00027 0.07781, -0.00027 0.07670,
 0.00084 0.07670, 0.00084 0.07781))'))

and so on, for another 3000 lines per INSERT statement.  Note that there are five coordinate pairs to define the box, the first and last have to be identical.  To judge from the disc space used, these were all stored in the index, which was substantially larger than the R-tree of Postgres.  This may explain the lower speed.  Finding an intersection or overlap was also much more verbose, as there were no operators handy, only functions. The one which I found to work was MBRContains, as illustrated by this SQL sample:
select ura,udec,astext(ubox) from us where
MBRContains(GeomFromText('Polygon((123.45 0.05, 123.46 0.05, 123.46 1,
123.45 1, 123.45 0.05))'),ubox);

This verbosity might have been acceptable had the performance been reasonable, but as noted below, the MySQL spatial index was many times slower than that of Postgres.

DB2

DB2 spatial indexing comes in an additional package, called the Spatial Extender.  I downloaded it with difficulty (another 300 MB of software and flaky FTP connection) but it eventually installed correctly.  But it would not work.  I eventually discovered, via the control centre, that it was not licenced.  I asked the IBM Scholars Program administration how I got a licence, and after a week, got an answer.  It turns out that the licence can be downloaded from their web-site (indeed I had already done this) but it needs to be copied to a new location and then a licence installation procedure is necessary.  Unfortunately by the time I got all the necessary details, I had insufficient time to complete testing this feature.  The manual is fairly comprehensive, but at 574 pages, as intimidating as the others. 

One immediate difference is that IBM's Spatial Extender does not use R-tree indexing, but a multi-level grid file.  These are not self-adjusting like B-trees and R-trees, but require the user to consider the range and granularity of up to three levels of the grid-file.  Although the data are expected to be floating-point values (geographical longitude and latitude are expected), they are scaled and then converted to 32-bit integers.  These scaling factors also have to be chosen by the user. 

Another problem then presented itself, just as for MySQL, the lack of an external format for spatial data.  I posted a question about this to the dedicated newsgroup ibm.software.db2.udb.spatial, and got useful replies, including this suggestion for populating the spatial object from the (ra,dec,radius) triplet:
db2gse.st_polygon(
> 'polygon(('||char(x-radius)||' '||char(y-radius)||','||
> char(x-radius)||' '||char(y+radius)||','||
> char(x+radius)||' '||char(y+radius)||','||
> char(x+radius)||' '||char(y-radius)||','||
> char(x-radius)||' '||char(y-radius)||'))', 0)
In other words, one has to convert all the numerical values into character strings because the polygon function will only take string arguments.  If such  horrible work-arounds are still necessary just to load spatial data, I dread to think what hoops one has to jump through to use them.  I discovered that the Spatial Extender has a circle data type, and wondered how it was implemented, and whether it might be suitable for our error-circles.  A respondent on the newsgroup, however, told me that the circle was in general represented by a polygon with 97 sides.  I did not think this would provide a very efficient implementation.

I wonder if the multi-level grid-file in DB2 will survive much longer: R-trees are much more flexible, and for our purposes much more suitable.  Other vendors who started out with other spatial index structures, such as Oracle, have now switched to R-trees or variants of them.  IBM recently took over Informix, which has used R-trees  for several years.

7. Performance Comparisons

7.1 Joins using R-trees with Mysql and Postgres

This section will be updated later with DB2 details, assuming I can get its spatial extender working.

The main test joined a section of 2MASS with an overlapping section of USBN-B.

The tests were carried out on the main node of hydra, which has dual-Xeon 2.2 GHz processors, and 1 GB of memory.

Elapsed times in seconds Mysql Postgres
Load 2mass, 5 million rows 205 116
Load USNOB, 2.6 million rows 153 93
Create spatial index on 2mass 488 617
ANALYZE (needed for Postgres)   67
Create index on USNOB 347 442
ANALYZE   27
Join tables on 1" circle 4030 283
total 5223 1645

The speed advantage of Mysql at all the earlier stages was lost in carrying out the join. The support for spatial data and R-trees has only just been introduced, so it is perhaps not surprising that it has not yet been optimised. On these figures, Postgres is the clear winner.

7.2 Simple Query Performance of Mysql and Postgres

It also seemed worth comparing the performance of Mysql and Postgres with a more substantial chunk of data. These tests used ten files of 2mass data, totalling around 50 million rows.

Operation Mysql Postgres DB2
Load data 1660 4030 2154
Sequential scan for H_M between 12.010 and 12.011 192 264 1327
Create B-tree on column H_M (+ANALYZE for Postgres) 2207 1265 1709
Indexed access for H_M between 12.010 and 12.011 0.15 419 0.05

This Postgres result was so odd that it was repeated: subsequent indexed selects took 3.9 seconds, and 0.7 seconds. Presumably some information needed to be stored in the cache before the query was executed efficiently.

7.3 Inner and Outer Join

The outer join (which copies unmatched rows from one table into the results) is important in astronomy, since unmatched sources are often of scientific interest. This test was only carried out using Postgres and R-trees: the inner and outer joins took about the same times, within limits of measurement.

7.4 Pixel-code versus R-tree

The pixel-code method is described in earlier documents: SkyIndexing and SpatialIndexing and was carried out using both Mysql and Postgres. The times for a spatial join using R-trees was only performed using Postgres, since the performance of Mysql was so poor.

Elapsed times in seconds   Pcode Mysql    Pcode Postgres    Pcode DB2   R-tree Postgres
Load psc_baa.tsv to tmp 42.3 88.4 42.4
88.7
Load psc_baa.pcode to tmpcode 8.8 50.4 17.5
-
Load b0900.tsv to usp 4.5 20.3 4.5
12.1
Load b0900.pcode to uspcode 1.4 5.8 1.3
-
Update tbox - - -
93.3
Update ubox - - -
12.0
Create index tcode on tmpcode(tcode) 112.7 53.7 69.6
-
Analyze tmpcode - 5.5 -
-
Create table pjoin2 using select distinct 119.2 31.5 46.6
-
Create index tmpind on tmp(tid) 95.8 31.0 64.6
-
Analyze tmp - 1.6 - -
Create index uspind on usp(uid) 11.5 1.5   30.0
-
Analyze usp - 0.2 - -
Create table pjoin from usp,pjoin2,tmp 31.5 64.6 106.4
94.8
total 427.7 354.5 382.9 300.9

Conclusion: here as well Postgres is somewhat faster than Mysql, and the R-tree indexing spatial join is faster than the pixel-code method using the same DBMS.  This is evident using the rather small tables in this example, but the difference is larger as the table size is increased.  This table compares the two methods using Postgres alone.

Number of 0.1° strips of USNO-B
Pixel-code algorithm
R-tree Indexing
1
88 secs
54 secs
3
406 secs
161 secs
9
1150 secs
553 secs
27
4623 secs
1964 secs

8. Summary

The spatial join is clearly performed better by an R-tree index than the pixel-code method; it has the advantages of speed, simplicity, and flexibility.

The R-tree indexing in MySQL is still in its infancy, and not a serious competitor to Postgres. 

DB2 may have the required features in theory, but the practice leaves a lot to be desired.  Its roots in the mainframe world, when a huge team of experienced administrators would be on hand, are still very evident.  It does not look suitable for use by an astronomer in a typical data archive environment.  The absence of a usable representation for nulls in external text files is a serious drawback.  The spatial indexing looks complicated and hard to use, compared to the R-tree indexing of Postgres (or even Informix).  Despite its potential advantages, I cannot recommend DB2 at present.

Postgres is not without faults, but it has a reasonable all-round performance and all the essential features we need.

cgp 2003 Sept 29