Report on Cross Matching Catalogues
1. Introduction
This document reports work carried out during 2003 July to
September. The main aims were:
- To compare the ease of use and scalability of three DBMS: DB2,
MySQL, and Postgres. We were particularly interested in their
ability to handle spatial data and carry out a spatial join between two
tables representing source catalogues.
- To compare the performance of two spatial join algorithms: (a)
using a true spatial index e.g. based on an R-tree, (b) pixel-code
method using a 2-d to 1-d mapping function.
The reasons for choosing these three DBMS to evaluate were:
- Postgres is a full-featured open source DBMS which has for some
time supported spatial data types and R-tree indexing.
- MySQL, another open-source DBMS, has a reputation for fast
execution of simple queries. The latest alpha release has limited
support for spatial data types with R-tree indexing.
- DB2 is a major commercial DBMS which is free for academic use
under the IBM Scholars programme; it has a spatial indexing option
(Spatial Extender) using multi-level grid-file indexing, and DB2 also
supports advanced features such as parallel hardware and distributed
databases.
The main attributes of each DBMS were evaluated, and for ease of
comparison, are listed in the same order in the following
sections. Section 7.4 reports on the comparison of pixel-code and
R-tree indexing for the spatial join.
2. Documentation
Postgres
Postgres has four on-line manuals: tutorial, administration guide,
reference manual, and programmers guide. These can be downloaded
in PDF or HTML forms, or searched on-line. There is also a nice
interactive version, somewhat wiki-like, in which one's own comments or
questions can be appended to the official contents. I posted a
question which resulted in several interesting responses by email.
There are also two good textbooks on-line. Many Postgres books are
also on sale; after browsing a number of them in Foyles, I purchased
one: PostgreSQL Developers Handbook
by Geschwinde and Schönig, as it was just about the only one to
even mention advanced topics such as R-tree indexing. I found this
quite useful. The on-line help command lists basic syntax, but that is
all. Software support comes from a number of newsgroups, which are
active and helpful. But because these newsgroups have not been properly
registered in the Usenet hierarchy the Janet Usenet service has refused to
make them available. Actually one group,
comp.dbms.postgres.hackers, is provided, probably by accident. I
had to use an alternative news server to see the others. The reference
manual is particularly good in telling you not only the exact syntax and
facilities of each SQL command, but the differences, if any, between it
and Standard SQL-92. Documentation on how to use R-tree
indexing was somewhat brief, and I had to do some browsing of help
files and experimenting to work out which operators on spatial objects
were optimised by the SQL engine.
Mysql
There is one large reference manual which can be downloaded from the
website of mysql.com. I also purchased two books out of the many
on sale: MySQL Reference Manual
by Widenius and Axmark (the principal authors of the package), and Core MySQL by Atkinson. These
were both quite useful. The documentation on geometric data types
and R-tree indexing only appeared on line just before I started my
tests, and was somewhat sketchy, but just enough to get me started. The
on-line help facilities of the client are pretty basic. Software
support, for those using the free version, comes from groups on the
sql.com website. When I had problems loading spatial data I asked
about external file formats; my question was not answered
satisfactorily, perhaps because there are as yet very few users (but
nobody from mysql.com bothered to reply).
DB2
The problem with DB2 is the sheer volume of documentation, which often
defeated my attempt to find a particular item of information that I was
sure must be present somewhere. One volume contains the index, but
reading an index as a PDF file is very tedious. There are 49 main
manuals, supplied as PDF files, several with over 700 pages. I
found it hard to search the PDFs on line, and printed out four of them
and parts of a few others. Even so these make a large pile on my
desk. I found that a user called Graeme Birchall had published
on-line a book called DB2 UDB V8.1
SQL Cookbook, which contained a lot of useful information on
advanced SQL techniques. Also on-line are some useful IBM
"Redbooks": I printed out DB2
UDB Evaluation Guide for Linux and Windows and Up and Running with DB2 for Linux,
not much over 300 pages each. Between them they gave me basic
installation instructions, though these proved confusing and
inadequate. I could find very little in the way of introductory or
tutorial material. I kept coming across phrases like "ask your
administration team to install this additional product for you".
If only I had an experienced DB2 administration team on hand. Jeff
Lusted, who has some experience of using DB2, said that I should try to
get hold of the HTML version of the documentation, as the cross-links
make it much easier to search. I could find manuals in HTML format
only for versions of DB2 earlier than V8. The various DB2 clients
provide some on-line help, but nothing that I actually found
useful. Because these documents were so difficult to use, I
decided to buy a book published by IBM called DB2 Universal Database V8 Handbook for
Windows, Unix, and Linux. This was mis-delivered by
Securicor to De Montfort University (although very clearly addressed
here) which further impeded progress. When it eventually arrived,
the book was quite useful in itself, but also for its pointers on where
to look in the huge manual collection. We had earlier purchased an
O'Reilly handbook called SQL in a
Nutshell which is a compilation of SQL dialect variations.
It is a pity that it covers the quirks only of Oracle, Microsoft SQL
Server, MySQL, and Postgres, ignoring DB2. Perhaps this reflects
the low profile of DB2 in the market.
It was not clear to me whether, as someone not paying for the product,
I qualified for direct support from IBM. There are, however, some
Usenet groups covering DB2, including one specifically for the DB2
Spatial Extender. When I asked a question, I got helpful answers
from two members of IBM staff, and one person who appeared to be
just a helpful user. The very low traffic on this group suggests
either that the Spatial Extender has very few users, or that nobody has
problems with it.
3. Installation, Configuration, and Tuning
Postgres
I found the RPM for Redhat Linux, downloaded it, and installed it
without problems. A few minutes work with the manual showed me how
to disable all the access restrictions and set the server to use our
RAID array for all its files. I read a number of documents
on tuning Postgres, and made a few changes when I found it was using
only a rather small proportion of our one gigabyte of memory, but never
managed to detect any significant speed improvement. Unless I
missed something important, it looks to me as if Postgres comes
reasonably well tuned.
MySQL
I found the RPM for Linux, downloaded it, and installed it without
problems. But configuring it took a lot longer: the RPM was set up
to put all its database files on the /var partition which would not have
enough space. The server appeared to have a command-line option to
change this, and also an environment variable. I tried both of these in
various combinations without effect. Eventually I discovered in
the manual that the only way to change the disc usage setting was to
download a source-code version of MySQL and compile it myself with an
altered configuration file. This seemed daunting, and I searched
for help on the newsgroups. As nearly always, the problems I
encountered with all three DBMS were common ones, and very frequently
asked on the newsgroups. Why the software producers take no notice
of these common problems I cannot understand. From answers in the
newsgroup I discovered that there was in fact a work-around without
re-compiling: create a database, move the entire directory to the
location I really wanted (this needed root and lots of file permission
changes), and then insert a soft-link from the old location to the new
one. This worked. I did not find any mention of tuning in
the MySQL documentation, and did not do any. It seems to come
ready to use.
DB2
Downloading DB2 RPMs requires logging in to the IBM Scholars (sic)
web-site and entering one's user-name and password several times
over. Eventually you find a huge list of DB2 products to
download. Patricio Ortiz had downloaded V8 of the Enterprise
Server earlier, so initially I used this. But (as noted below)
after over two weeks of struggling with the package and finding that the
administration server would not work properly, I decided that it would
be easier to start again from scratch perhaps with a slightly less huge
installation. DB2 UDB comes in a variety of versions and
installation kits, from Personal Workstation kit, to Enterprise.
We had already discovered that the Spatial Extender required version
8.0 or later on Linux, but the documentation was confusing about
whether kit was needed. Eventually I downloaded the Personal
Workstation kit for V8.1 because I could not find the Personal
Developer Kit. This was a huge RPM, around 400 MB, and it took at
least five attempts to get a complete download.
Installation was then something of a nightmare. For a start it
overwrote some files from the earlier installation which I wanted to
keep. It requires JDK v1.3 whereas we already had v1.4.1
installed, and I kept having to change environment variables to get the
old version picked up. DB2 also requires three specific user-names
to be set up: db2inst1, db2fenc1, and dasusr1. The installation
wizard claimed it would create these users, but it did not: apparently
on systems running NIS this has to be done by hand. I spend
several hours trying to get my own user-name granted sufficient
privileges to be able to use DB2, but failed, but in the process messed
up some of my own shell's settings. The new users were set to use
Korn shell, which was unfriendly. I eventually carried out all my
tests of DB2 while logged in as db2inst1, and did cuts/pastes from/to my
own account for logging etc. It would be excessively tedious to
report all the problems I had, no doubt some due to my own lack of
experience, but I suspect it always takes a great deal longer to install
DB2 than, say, Postgres.
The relatively slim IBM book on DB2 contained three whole chapters on
tuning, so it is obviously an important topic. On quickly browsing
these, it appeared that transaction processing was the main area needing
to be tuned. Since I ran out of time and had no interest in
transactions, I made no attempt to tune or optimise my DB2 queries.
4. General Usability
Postgres
It was easy to start and stop the server, and I could do this without
logging in as root. The client was equally easy to start and stop.
A long-running query could be cancelled just by typing ctrl/C -
this caused the current query to be rolled back if it was doing an
update, or otherwise stop quickly and cleanly. There seemed to be
no ill effects from aborting queries. The command-line interface,
psql, made it easy to recall and edit commands. Multi-line
commands could be entered fairly easily. The error messages were
good, telling me what was wrong and which part of the SQL statement had
caused it. There was a command \timing to toggle timing
information given when each operation concluded. The
user-interface allowed command-line recall and editing, and external SQL
scripts. Although I found Postgres to be generally satisfactory,
there were a few quirks:
- After creating an index it is necessary to perform an ANALYZE operation, otherwise the
index is not used in subsequent queries. In dynamic environments
the operation has to be repeated regularly, but for a static table once
is enough. Index creation takes a bit longer because of this, the
times were included in those reported here.
- Floating-point constants are taken as double-precision by default
with no automatic casting to other floating-point types. This
causes problems when using an index on a real column. Thus the
syntax SELECT * FROM table WHERE
vmag BETWEEN 12.5 AND 12.6 ; would not use an index on the
(real) column vmag, it
needed to be re-phrased thus: SELECT
* FROM table WHERE vmag BETWEEN 12.5::FLOAT4 AND 12.6::FLOAT4 ;
- In a few cases I found
that queries took longer when an index was available than when it was
absent, or when two indices were provided than when only one was
provided. It is possible to influence the query plan, but I did
not have time to explore this fully.
It is notable that the most common problem posted to all DBMS
newsgroups appears to be a complaint that the DBMS is not using an index
when the user expected it to be. The EXPLAIN command in Postgres
at least gives some information on how it thinks any given query is
best evaluated, and some tools to influence its optimiser.
Postgres allows user-defined functions to be specified in a number of
languages, including its own procedural version of SQL called PL/PGSQL,
and used easily in SQL queries.
MySQL
It was necessary to log in as a privileged user to start the server,
but, oddly, not to stop the server, nor start/stop the client.
Aborting a query did not seem to be possible without crashing the
client, but it was usually possible to re-start it without serious
ill-effects. The SQL implemented by MySQL is less complete and
less standard than for the other DBMS, though every release brings
improvements. Some sub-selects now seem to be supported.
There are various back-ends: the one I used was the default one which
does not fully support transactions with roll-back, since that was not
necessary for our requirements. Data types were a problem:
on Postgres and DB2 REAL is a 4-byte type, but FLOAT an 8-byte type;
with MySQL the reverse was true. I found that the only way to get
a 4-byte and 8-byte floating type for the same columns in MySQL and
Postgres was to use non-standard types called FLOAT4 and FLOAT8.
MySQL provides the elapsed time of each query by default. The
user-interface allowed the usual command-line editing and recall, and
use of external script files. MySQL also provides an EXPLAIN
command, although the output is not quite as easy to interpret.
There does not seem to be any way of influencing the query optimiser,
should the explanation be unsatisfactory.
MySQL only supports user-defined functions if written in C, and a
privileged account is then needed to link the compiled version with the
MySQL executable.
DB2
There are three different user-interfaces, all of them with their own
limitations:
- The db2 command
provides a simple way of entering SQL, here the lines need no semi-colon
at the end, so therefore long lines need a continuation marker. I
could not find this documented anywhere, but found by trial and error
that "\" worked. The db2 interface provided no execution or
elapsed time information. There was no command-line recall or
editing, no doubt designed for the user who always manages to get
multi-line SQL commands perfect every time: I wonder if such users exist?
- The db2batch command
seemed similar, but had the bonus of providing elapsed time
information. This could be augmented by cpu-time if db2batch was
started with the right command switch. The cpu-time was listed
anyway, but was always zero without this switch: why it was not the
default is hard to understand. The other switches for db2 and
db2batch were quite different for identical functionality, presumably
written by different teams who did not bother to talk to one
another. Db2batch required a semi-colon, but no continuation
marker. It also omitted to provide command-line recall and
editing. I would have used db2batch all the time, but it did not
support all SQL commands, only those in some (undocumented) core
set. The data loading commands had to be issued from db2, why I
cannot imagine.
- The db2cc command
invoked something called the "control centre" - a GUI which allowed
other GUIs to be started, including the "command centre". This
also allowed SQL commands to be entered. These also required a
semi-colon, so needed no continuation marker. This would have been
a good interface, had it not been programmed incompetently in
out-of-date Java. To my amazement I found that it did not support
cut/paste to/from other windows on my screen. I saw many critical
comments on the GUIs on various newsgroups, saying, essentially that
they were a waste of time. I came to agree with these comments.
Apart from grappling with these hostile user-interfaces, other problems
I had with DB2 include the following:
- There seems no way in a single SQL statement to select from one
table and have it generate a new table as the result. This is
possible in Postgres (using SELECT INTO table) and in MySQL (using
INSERT INTO table SELECT), but in DB2 it required two statements, one to
create a new table (with all the right column names and data types) and
then a second one to do the INSERT INTO. This is because, I guess,
the normal way SQL is used is just to produce text; the idea that one
might want to make a selection and then work on it further is alien to
the SQL way of thinking, though this is just what scientists want to do
all the time. The lack of such a facility in DB2 was a serious
drawback, not only in impeding data exploration, but because all my SQL
scripts had to be altered yet again.
- I could not find any way of stopping a run-away query.
Control/C and the like had no effect. If I killed the process it
seemed to prevent me using database and when loading it left it in an
inconsistent state. I generally had to stop and re-start
the server (using a different user-name, naturally). In one case
this did not work, and I had to re-boot the machine.
- The transaction log filled up unexpectedly and stopped
progress. I could not find out how to eliminate these logs nor how
to disable logging, but eventually discovered how to make more space for
these logs, and I increased it by a factor of one thousand. The
documentation suggests that several of these parameters are set by
default to values too small for modern computers.
There is no EXPLAIN command in DB2, but there is a separate product
called Visual Explain. I did not have time to try this out.
DB2 allows user-defined functions in several languages, including a
procedural version of SQL.
5. Data Loading
Postgres
Loading data was simple, especially since the 2MASS data files were
designed for use with Postgres. I saw in the manual that a data
load operation is performed within a single transaction, so it is all
rolled back if anything goes wrong, but I saw no errors. A typical
load command is:
COPY mytable FROM '/home/cgp/db/data2/psc_baa.csv' WITH DELIMITER '|' ;
Nulls were represented by \N in the 2MASS files, the Postgres
default, and I used this in the data files I prepared myself from the
raw USNO-B distribution.
MySQL
Loading non-spatial data was just as easy, since MySQL seems to use the
same defaults as Postgres. The load syntax is not something that
the designers of SQL have ever bothered to standardize, and MySQL
requires a slightly more verbose syntax:
LOAD DATA INFILE '/home/cgp/db/data2/psc_baa.csv' INTO TABLE mytable FIELDS TERMINATED BY "|" ;
DB2
Here the serious problems began. I searched the two volumes of
SQL commands for a long time without finding a data load command, and
the Redbooks were no help. I suppose DB2 is almost always used in
a transaction-processing environment where data is captured live, and
hardly ever needs to be loaded from external files. I finally
discovered a separate PDF manual called Data Movement Utilities Guide and Reference
a mere 429 pages long. It eventually yielded up the secrets of the
copy command. I then found that it could not be used from the
db2batch environment, or the db2 gui, only from the db2 command-line
interpreter. The typical syntax is:
LOAD FROM '/home/cgp/db/data2/psc_baa.csv' OF DEL MODIFIED BY COLDEL| INSERT INTO mytable ;
But data loads failed repeatedly. At first I thought that it was
the date/time fields which were causing problems: with 60 fields per
line and hopeless error messages, it took me some time to eliminate each
possibility. Tests with samples of a few hundred or thousand rows
sometimes worked fine. Eventually it seemed that the nulls were
not accepted. The manual suggested (without being definitive on
the subject) that there was no external representation of nulls, except
for the absence of data between two field delimiters.
Unfortunately every such null-by-default gave rise to a warning
message. I left one data load running when I went home at night;
the messages were still scrolling off my terminal screen at the rate of
hundreds per second when I arrived the following morning. I had to
abort this run, which left the database in a mess, and eventually I had
to reboot the machine. Since all the nulls were in numerical
columns (magnitudes) I found a work-around by using the Unix tr utility to convert each null
into the value of "-1" which was out of the normal range. By this
means I finally managed to load some strips of the 2MASS data. I
did not get around to converting these values back into nulls, though I
suppose that an UPDATE
statement would have been able to do that. Even so each load
operation generated lots of incomprehensible messages, of which this is
a typical example:
SQL3501W The table space(s) in which the table resides will not be placed in
backup pending state since forward recovery is disabled for the database.
Although I could not get timing information as with other DB2 commands
since the db2batch facility would not load data, the LOAD command
reported the clock times at the start and end, so the performance could
be measured. The only good point of DB2's data loading was that it
was fast, and I noticed that it used the cpu time of both
processors. The times taken to load 5.1 million rows of 2MASS were
as follows:
Postgres
|
MySQL
|
DB2
|
116 secs
|
205 secs
|
44 secs
|
Given that nulls are an essential element of nearly all astronomical
tables, and that finding a suitable out-of-band value to represent them
is a non-trivial undertaking, I think that if it is true that DB2 has no
satisfactory way of representing nulls, this omission forms a very
serious obstacle to its continued use. It's nice that DB2 is fast,
even better if it did the job we wanted it to do.
6. Spatial Indexing
Postgres
Loading spatial data from a text file required a slightly odd format
with extra parentheses, but I had no trouble in loading the required
spatial column, nor in writing a program to convert USNO-B data to the
required form. The basic spatial object is a rectangle, specified
by the coordinate pair (x,y) of two opposite corners, and one can create
an R-tree index on these bounding boxes. The SELECT statement can
use operators (such as &&) to test for overlap of the
rectangles. It was easy enough to read the rectangle coordinates
from an external text file, or to generate the rectangles on-line using
an UPDATE statement (which is obviously simpler). There is little
to report here: spatial indexing worked as expected, and was reasonably
fast.
MySQL
MySQL has only added spatial data support in its latest (alpha)
release, and it shows many omissions which will no doubt be rectified
over in future releases. The software is aimed at the GIS market,
and uses R-trees for indexing. The documentation was very sketchy,
and some trial and error was needed to discover what actually
worked. The first problem was to load the data, as no external
format for the spatial data type was documented. I asked in a
support newsgroup, but got no answer. I think this may be a
feature still missing. The only alternative was to use INSERT
statements to load the spatial data, but loading one row per INSERT
would have taken far too long. I found by experiment that I could
not load a large table in a single INSERT statement, but that 1000
rows/INSERT appeared to work. I therefore write a new utility to
convert the USNO and 2MASS data files into the appropriate SQL INSERT
statements. These were very
verbose, inserting a single row involves this:
insert into us values
(0.000286,0.07725278,4.06E-05,NULL,19.78,20.59,19.77,17.34,
polygonfromtext('polygon(( 0.00084 0.07781, -0.00027 0.07781, -0.00027 0.07670,
0.00084 0.07670, 0.00084 0.07781))'))
and so on, for another 3000 lines per INSERT statement. Note that
there are five coordinate pairs to define the box, the first and last
have to be identical. To judge from the disc space used, these
were all stored in the index, which was substantially larger than the
R-tree of Postgres. This may explain the lower speed.
Finding an intersection or overlap was also much more verbose, as there
were no operators handy, only functions. The one which I found to work
was MBRContains, as illustrated by this SQL sample:
select ura,udec,astext(ubox) from us where
MBRContains(GeomFromText('Polygon((123.45 0.05, 123.46 0.05, 123.46 1,
123.45 1, 123.45 0.05))'),ubox);
This verbosity might have been acceptable had the performance been
reasonable, but as noted below, the MySQL spatial index was many times
slower than that of Postgres.
DB2
DB2 spatial indexing comes in an additional package, called the Spatial
Extender. I downloaded it with difficulty (another 300 MB of
software and flaky FTP connection) but it eventually installed
correctly. But it would not work. I eventually discovered,
via the control centre, that it was not licenced. I asked the IBM
Scholars Program administration how I got a licence, and after a week,
got an answer. It turns out that the licence can be downloaded
from their web-site (indeed I had already done this) but it needs to be
copied to a new location and then a licence installation procedure is
necessary. Unfortunately by the time I got all the necessary
details, I had insufficient time to complete testing this feature.
The manual is fairly comprehensive, but at 574 pages, as intimidating as
the others.
One immediate difference is that IBM's Spatial Extender does not use
R-tree indexing, but a multi-level grid file. These are not
self-adjusting like B-trees and R-trees, but require the user to
consider the range and granularity of up to three levels of the
grid-file. Although the data are expected to be floating-point
values (geographical longitude and latitude are expected), they are
scaled and then converted to 32-bit integers. These scaling
factors also have to be chosen by the user.
Another problem then presented itself, just as for MySQL, the lack of
an external format for spatial data. I posted a question about
this to the dedicated newsgroup ibm.software.db2.udb.spatial, and got
useful replies, including this suggestion for populating the spatial
object from the (ra,dec,radius) triplet:
db2gse.st_polygon(
> 'polygon(('||char(x-radius)||' '||char(y-radius)||','||
> char(x-radius)||' '||char(y+radius)||','||
> char(x+radius)||' '||char(y+radius)||','||
> char(x+radius)||' '||char(y-radius)||','||
> char(x-radius)||' '||char(y-radius)||'))', 0)
In other words, one has to convert all the numerical values into
character strings because the polygon
function will only take string arguments. If such horrible
work-arounds are still necessary just to load spatial data, I dread to
think what hoops one has to jump through to use them. I discovered
that the Spatial Extender has a circle data type, and wondered how it
was implemented, and whether it might be suitable for our
error-circles. A respondent on the newsgroup, however, told me
that the circle was in general represented by a polygon with 97
sides. I did not think this would provide a very efficient
implementation.
I wonder if the multi-level grid-file in DB2 will survive much longer:
R-trees are much more flexible, and for our purposes much more
suitable. Other vendors who started out with other spatial index
structures, such as Oracle, have now switched to R-trees or variants of
them. IBM recently took over Informix, which has used
R-trees for several years.
7. Performance Comparisons
7.1 Joins using R-trees with Mysql and Postgres
This section will be updated later with DB2 details, assuming I can get
its spatial extender working.
The main test joined a section of 2MASS with an overlapping section
of USBN-B.
- 2mass: 5 million rows of data, file psc_baa covered 0 to 360 degs
RA, and 0.0 to 1.67 degs Dec, but with just these columns selected: RA,
DEC, position-error, three magnitudes, error-box.
- USNO-B: 3.6 million rows of data, covering 0 to 360 degs RA, 0.0
to 0.6 degs Dec., with these columns selected: RA, DEC, position-error,
five magnitudes, error-box.
The tests were carried out on the main node of hydra, which has
dual-Xeon 2.2 GHz processors, and 1 GB of memory.
| Elapsed times in seconds |
Mysql |
Postgres |
| Load 2mass, 5 million rows |
205 |
116 |
| Load USNOB, 2.6 million rows |
153 |
93 |
| Create spatial index on 2mass |
488 |
617 |
| ANALYZE (needed for Postgres) |
|
67 |
| Create index on USNOB |
347 |
442 |
| ANALYZE |
|
27 |
| Join tables on 1" circle |
4030 |
283 |
| total |
5223 |
1645 |
The speed advantage of Mysql at all the earlier stages was lost in
carrying out the join. The support for spatial data and R-trees has
only just been introduced, so it is perhaps not surprising that it has
not yet been optimised. On these figures, Postgres is the clear winner.
7.2 Simple Query Performance of Mysql and Postgres
It also seemed worth comparing the performance of Mysql and Postgres
with a more substantial chunk of data. These tests used ten files of
2mass data, totalling around 50 million rows.
| Operation |
Mysql |
Postgres |
DB2 |
| Load data |
1660 |
4030 |
2154
|
| Sequential scan for H_M between 12.010 and
12.011 |
192 |
264 |
1327
|
| Create B-tree on column H_M (+ANALYZE for
Postgres) |
2207 |
1265 |
1709
|
| Indexed access for H_M between 12.010 and
12.011 |
0.15 |
419 |
0.05
|
This Postgres result was so odd that it was repeated: subsequent
indexed selects took 3.9 seconds, and 0.7 seconds. Presumably some
information needed to be stored in the cache before the query was
executed efficiently.
7.3 Inner and Outer Join
The outer join (which copies unmatched rows from one table into the
results) is important in astronomy, since unmatched sources are often of
scientific interest. This test was only carried out using Postgres and
R-trees: the inner and outer joins took about the same times, within
limits of measurement.
7.4 Pixel-code versus R-tree
The pixel-code method is described in earlier documents: SkyIndexing
and SpatialIndexing
and was carried out using both Mysql and Postgres. The times for a
spatial join using R-trees was only performed using Postgres, since the
performance of Mysql was so poor.
| Elapsed times in seconds |
Pcode Mysql |
Pcode Postgres |
Pcode DB2 |
R-tree
Postgres
|
| Load psc_baa.tsv to tmp |
42.3 |
88.4 |
42.4
|
88.7
|
| Load psc_baa.pcode to tmpcode |
8.8 |
50.4 |
17.5
|
-
|
| Load b0900.tsv to usp |
4.5 |
20.3 |
4.5
|
12.1
|
| Load b0900.pcode to uspcode |
1.4 |
5.8 |
1.3
|
-
|
| Update tbox |
- |
- |
-
|
93.3
|
| Update ubox |
- |
- |
-
|
12.0
|
| Create index tcode on tmpcode(tcode) |
112.7 |
53.7 |
69.6
|
-
|
| Analyze tmpcode |
- |
5.5 |
-
|
-
|
| Create table pjoin2 using select distinct |
119.2 |
31.5 |
46.6
|
-
|
| Create index tmpind on tmp(tid) |
95.8 |
31.0 |
64.6
|
-
|
| Analyze tmp |
- |
1.6 |
- |
-
|
| Create index uspind on usp(uid) |
11.5 |
1.5 |
30.0
|
-
|
| Analyze usp |
- |
0.2 |
- |
-
|
| Create table pjoin from usp,pjoin2,tmp |
31.5 |
64.6 |
106.4
|
94.8
|
| total |
427.7 |
354.5 |
382.9 |
300.9
|
Conclusion: here as well Postgres is somewhat
faster than Mysql, and the R-tree indexing spatial join is faster than
the pixel-code method using the same DBMS. This is evident using
the rather small tables in this example, but the difference is larger as
the table size is increased. This table compares the two methods
using Postgres alone.
Number
of 0.1° strips of USNO-B
|
Pixel-code
algorithm
|
R-tree
Indexing
|
1
|
88 secs
|
54 secs
|
3
|
406 secs
|
161 secs
|
9
|
1150 secs
|
553 secs
|
27
|
4623 secs
|
1964 secs
|
8. Summary
The spatial join is clearly performed better by an R-tree index than
the pixel-code method; it has the advantages of speed, simplicity, and
flexibility.
The R-tree indexing in MySQL is still in its infancy, and not a serious
competitor to Postgres.
DB2 may have the required features in theory, but the practice leaves a
lot to be desired. Its roots in the mainframe world, when a huge
team of experienced administrators would be on hand, are still very
evident. It does not look suitable for use by an astronomer in a
typical data archive environment. The absence of a usable
representation for nulls in external text files is a serious
drawback. The spatial indexing looks complicated and hard to use,
compared to the R-tree indexing of Postgres (or even Informix).
Despite its potential advantages, I cannot recommend DB2 at present.
Postgres is not without faults, but it has a reasonable all-round
performance and all the essential features we need.
cgp 2003 Sept 29