r3 - 16 Aug 2002 - 13:17:56 - ClivePageYou are here: TWiki >  Astrogrid Web  >  DocStore > DataDocs > PostgreSQL

PostgreSQL and MySQL Evaluations

For convenience I am reporting here my detailed notes on both of these free DBMS, since there are many similarities, and point-by-point comparisons may be instructive.

Documentation

Both packages have on-line manuals which are almost adequate; Mysql has one good reference manual, Postgres has useful material in four volumes (Tutorial, User's Manual, Administrator's Guide, and Programmer's Guide). I supplemented them with a couple of books:

  • Core MySQL, by Leon Atkinson
  • MySQL Reference Manual, by Michael Widenius and David Axmark (the originators of the package)
  • PostgreSQL Developer's Handbook, by Ewald Geschwinde and Hans-Juergen Schoenig.

I found it very hard to get adequate information on the R-tree indexing and spatial data types of Postgres, as almost all of the books and on-line documents stop short of such advanced topics. I browsed newsgroups for some time, and eventually contacted an expert in Moscow, who was quite helpful, but he was actually working on extending the indexing to use Generalised Search Trees (GiST?), which seemed to me to be too close to the bleeding edge. It turned out that the BOX data type consisted of two pairs of (x,y) coordinates for opposite corners of a rectangular box, and that an R-tree can be built on a column of boxes. I worked out that the distortion of the (RA,DEC) coordinate scale at high declinations can be adequately represented by stretching the RA axis of these boxes, designed to enclose the error circle. It took me many attempts to get the ASCII input format for box data correct, but eventually I got it all working. The && operator is then available to test for box overlap in joins, and it uses the R-tree index as expected.

Installation

In both cases installation was easy: I found RPMs for Linux on the web, downloaded the latest version declared "stable" and installed (as root). I avoided grappling with the user account and password systems by making us of the pre-defined test databases in each case.

Postgres allowed me to choose the data storage area, but Mysql insisted on using /var which I accepted (since there was just enough space available). I discovered many user group messages asking how to change this: the simplest solution appears to be to install a soft link there to the directory or file-system you really want to use. This is not all that satisfactory.

Data Loading

Neither DBMS supported any (understandable) mechanism for loading binary data, so I converted the data source extracts (USNO-A2 and GSC) to plain ASCII text by writing suitable conversion programs. This allowed me to add the PCODE column computed from the postion, and insert small number of extra rows required by the PCODE method. The USNO table had some null values in the rmag and bmag columns. I found, with some searching of the documentation, that both DBMS would accept records separated by newline characters, and fields separated by TAB characters, and that \N was the default value for a null. Converting the table to ASCII text involves a significant programming effort, and one has to think carefully about the number of significant digits on all floating-point data. The rounding involved in this also means that exact comparison of selections on DBMS tables and the original FITS files will produce very slightly different results, which is unfortunate. Conversion to text also increases the data volume by a factor of 3 or 4, compared to a sensibly compact binary representation.

We discovered afterwards that, while the TAB and NL separated fields and rows were accepted by Oracle and MS SQL Server, they did not accept the \N for nulls. Clearly this is not an SQL Standard.

The loading commands were also very different:

Postgres LOAD DATA LOCAL INFILE 'filename' INTO TABLE table ;
Mysql COPY table FROM 'file' ;

Index Creation

R-tree creation in Postgres took 6400 seconds; doing this for the whole USNO-A2 catalog would have taken around two weeks. The B-tree generation on both was reasonably fast. With Postgres it is necessary to perform a VACUUM ANALYZE command after creating an index, or else the system refuses to use it. The time to do this is included in the reported figures.

With Postgres, once an index has been created, subsequent insertions become extraordinarily slow. It is nearly always better to drop all indices and later re-create.

Measuring Performance

I found it very hard to get timing figures for Postgres, and eventually wrote my own wrapper which read the system clock before and after submitting an SQL command. This also used the du utility to measure the change in disc space on the set of data directories, as mapping tables to files is complex and non-obvious. Finding how many rows had been selected by each SELECT was also difficult, and a subsequent SELECT COUNT(*) FROM table could take a long time.

Mysql, on the other hand, reported the time taken by each query, and the number of rows selected. It was easy to find the table size, since tables map to files. In addition the special query SELECT COUNT(*) FROM table is performed almost instantly from information stored internally.

SQL Differences

Postgres was case-sensitive in its column and table names, while Mysql was not. There were many other differences, the main ones noted below. Postgres claimed to conform, more or less, to the Entry level of SQL92, while Mysql admits to supporting much less than this.

The EXPLAIN command worked in both, but the results were completely different, and not easy to interpret. In principle the runes tell you whether your query will make use of an index or not.

Exporting tables

Both DBMS by default produced simple text files, and both had options to produce simple HTML. The tags used should make it easy to transform this into any desired XML format such as VOTable.

But, producing an ASCII file for export was difficult in both systems, essentially because they are client-server packages. The user talks to the client, but the server, which runs under some different user-id, has to write the data. I found it impossible to export data to my own home directory: it was necessary to write to some area that the server owned, and then copy and change permissions.

The syntax also was different for the rather common case of selecting/projecting information into a new table, rather than to ones screen:

Postgres SELECT * INTO newtable FROM table WHERE ... ;
Mysql CREATE TABLE newtable SELECT * FROM TABLE WHERE ... ;

Joins

The JOIN syntax was similar, but by default (SELECT * FROM) produced different results. A more serious problem is that if the two tables have names in common, the JOIN was only feasible if the SQL command qualified each name (usno.ra) and renamed the columns in the output to avoid such a clash. I started off with RA and DEC columns in both tables, but decided to rename them to simplify the SQL. Even so, the SQL isn't simple, here is one of my earlier (successful) join commands:

select distinct on (g.id)
g.id as gid, g.ra as gra, g.dec as gdec, g.pos_err as gerr,
g.mag as gmag, u.pcode as pcode,
u.id as uid, u.ra as ura, u.dec as udec, u.rmag as rmag, u.bmag as bmag,
round(degrees(7200*asin(sqrt((sin(0.5*(u.dec-g.dec)))^2 + cos(u.dec) *
 cos(g.dec) * (sin(0.5*(u.ra-g.ra))^2)))),2) as sep
into table myjoin
from gsc as g, usno as u
where g.pcode = u.pcode
and 2.0*asin(sqrt((sin(0.5*(u.dec-g.dec)))^2 + cos(u.dec) *
 cos(g.dec) * (sin(0.5*(u.ra-g.ra))^2))) < 5e-6+pos_err ;

I have my doubts that the average astronomer is going to have time to get to grips with all this. Of course, it should be possible to simplify it using pre-defined functions for things like great-circle distance. But not all the complexity can be removed.

Sub-queries

Postgres can do subqueries, for example to find outliers:

SELECT upcode,bmag INTO table2 FROM usno 
 WHERE bmag > (SELECT AVG(bmag)+3*(STTDEV(bmag))) ;

Mysql cannot do that, but needs to store the results of the first query in variables, which can then be used in the second:

SELECT @myavg := AVG(bmag), ~mysd := STDDEV(bmag);
CREATE TABLE table2 SELECT upcode,bmag FROM usno 
 WHERE bmag > @myavg + 3 * @mysd ;

Although SQL purists will no doubt disagree, I think the Mysql approach of dividing up a problem into modular units is more natural, certainly to those of us much more used to computing with procedural languages. What is sad is that (as far as I know) neither syntax actually conforms to SQL92.

Scalability

Both DBMS have measures to overcome the 2 GB limit of many file systems, allowing a single table to be split into a number of files. The Mysql solution has a Merge database type specifically to solve this, but with some restrictions.

Neither DBMS was designed to operate on distributed databases, or any specific support for multiple cpus. Mysql has built-in support for replication - a master-slave database can be set up, to improve access speed and robustness.

Preliminary Conclusions

On the whole I found it easier to use Mysql than Postgres, but both had their infuriating features. Mysql was clearly faster than Postgres on all tests, but only by 30% to 50% in many cases. I did not test the InnoDB back-end, which supports transactions fully. I have heard mixed reports on whether it is slower or faster than the default MyISAM back-end.

-- ClivePage - 2002 August 16.

Edit | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r3 < r2 < r1 | More topic actions
 
AstroGrid Service Click here for the
AstroGrid Service Web
This is the AstroGrid
Development Wiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback