Using Postgres for Spatial Data

Introduction

Recently a few people have asked me how to use Postgres to handle spatial data. In fact most of the information is buried in various Postgres manuals or help files, but it is rather dispersed and hard to find. Perhaps also the manuals available from the Postgres web site are designed for the database expert, rather than the scientist. These notes are designed to fill that gap. They are naturally somewhat astronomy-specific, but may give some guidance to anyone thinking of using Postgres to handle some spatial data from some other field, for example geographical coordinates.

Postgres is an object-relational database management system (DBMS), which means it has the usual facilities of a relational DBMS plus the ability to handle structured (user-defined) data types. It is the only open-source DBMS with such advanced features. It is true that the lastest Alpha release of MySQL has added support for spatial data, but my tests showed that it was still rather basic and incomplete, and performed poorly when compared to Postgres. Postgres has had spatial data support for several years, with a range of spatial data types and operators, and provides R-tree indexing for efficient searching of spatial data.

Spatial Data Types

The geometric types built in to Postgres are listed in the User's Guide, section 5.7 (references here are to the documentation for Postgres version 7.3). They comprise: point, line, lseg, box, path, polygon, and circle. For my application I wanted to handle small regions on the sky, typically the small region around each measured position which allows for errors of measurement. The common term is "error box" but that does not imply that they are box shaped, often they are circles or ellipses. Initially I was not sure whether the circle or rectangular box would be most appropriate. The important thing is to make sure that the DBMS can index the data efficiently: the art of using databases is, very largely, the art of using indices.

In the Reference Manual (section I) under CREATE INDEX the notes say that the query optimiser will consider using an R-tree index when the indexed attribute is involved in a comparison using any of these operators:

            <<   &<   &>   >>   @   ~=   && 
The meaning of these operators is made clear in the tables in the User's Guide section 6.9. The only one of these operators which can be used on circles is @, and then only to find whether a point is within the circle or not; there does not seem to be any simple way of testing whether two circles overlap. In view of this, I chose instead to use the box data type, where the operator && can be used to test for overlap. A few experiments showed that this did indeed do what we wanted, and that R-tree indexing could be used to conduct efficient range searches and spatial joins.

To use the box datatype it is necessary to draw a rectangular box around each error region, whether it be circular, elliptical, or whatever. The database, with the assistance of R-tree indexing, can find overlapping boxes easily, and this does the bulk of the work. A bit of extra calculation is then necessary to determine which of these overlapping boxes actually involves an overlap of the circles (or whatever shapes) that they have been drawn around, but this only takes a little extra computer time.

Creating a Spatial Column in a Table

If you are creating a new table from scratch, then you can create it with one additional column to contain the error-box information. As a simple example, I will assume we want to store an astronomical source catalogue, where there are just four main columns for: Right Ascension (RA) and Declination of the source (for non-astronomers: these are the equivalent of longitude and latitude on the Earth), the radius of the error on this position (assumed to be circular for this simple example), and a source magnitude (or brightness). Most astronomical tables would be much wider, but the principles are the same. The metadata of this simple table might look like this

column name data type physical units Comments
ra double precision degrees Right ascension of star
decl double precision degrees Declination of star
poserr real degrees Position error on above coordinates (assumed equal)
vmag real magnitudes Visual magnitude of star
errbox box degrees Rectangular box around the error circle

Such a table can be created like this:

  CREATE TABLE mystars (ra DOUBLE PRECISION, decl DOUBLE PRECISION, poserr REAL, vmag REAL, errbox BOX) ;

Notes:

  1. Postgres is not case-sensitive, I am using upper-case in my examples to show the bits which are keywords in SQL, and lower-case for the names and other bits which I chose myself.
  2. The name of the declination column cannot be "dec" because that is a reserved word in SQL (or at least in Postgres) so I have called it "decl".
  3. It might be desirable to specify NOT NULL for some of these columns which cannot be null, but that's a refinement ignored here.
  4. The box data type actually holds four DOUBLE PRECISION values, the x and y coordinates of two opposite corners. In some cases REAL values would be accurate enough, but this is not currently supported for the box data type in Postgres.
  5. I have chosen to give the coordinates in degrees, which makes things simpler for users entering direct SQL. All the trig functions, however, require arguments in radians, so one has to make liberal use of the RADIANS() function to convert angles.

If you already have a table in existence, it is possible to just add a new column of the required type:

  ALTER TABLE mystars ADD COLUMN errbox BOX ;
This seems to be a fast operation; I suspect the hard work of re-structuring each row happens only when the new column is populated with data.

Populating a Spatial Column

You can load data into a table using either INSERT or COPY commands. If you just want to load the first four columns of each row, then a suitable INSERT statement might be:

  INSERT INTO mystars (ra, decl, poserr, vmag) VALUES (12.30, -4.56, 0.01, 12.9) ;
A single INSERT statement can insert values into many rows, the manuals explain the syntax. When loading bulk data, however, it is much faster to use the COPY command to load from a suitably formatted external file. Postgres has a defined binary format, but it is somewhat difficult to construct, and I always loaded text files, since they are easier to construct and to check. The default is for fields to be separated by tabs, but an alternative field separator can be specified. The default value for null fields is "\N" but again an alternative can be specified. The equivalent COPY command for a tab-separated text file would then be something like this:
  COPY mystars(ra, decl, poserr, mag) FROM '/home/cgp/data/mystars.tsv' ;
In both INSERT and COPY commands you do not need to list the column names if you are inserting or loading data into all columns.

There are two ways of filling the errbox column with the spatial data. If you are constructing the text file yourself you may wish to put the error-box data in it as well. There is a defined format for boxes, each corner is given as an (x,y) pair, with a comma between the two values, one for each opposite corner, and another pair of parentheses around the whole thing. To make this clearer in the example below I have used the vertical bar symbol as the field separator, then a line in the text file might look like this:

12.30|-4.56|0.01|12.9|((12.29,-4.57),(12.31,-4.55))
Note: here I have simply added/subtracted the error-radius, 0.01, to/from the RA value. To be correct we need to take into account the distortion of the scales at non-zero declinations.

Then the copy command would look like this:

COPY mystars FROM '/home/path/mystars.csv' WITH DELIMITER '|' ;

The alternative is to populate the errbox column with an UPDATE statement, calculating the box corners from the (ra,decl) value and the error-radius. This is slightly messy as one has to use the point and box functions, e.g. like this:

UPDATE mystars SET errbox = BOX(POINT(ra+poserr/COS(RADIANS(decl)),decl+poserr),
                                POINT(ra-poserr/COS(RADIANS(decl)),decl-poserr)) ;
Here I have made a crude allowance for the fact that the length of the box along the RA axis needs to be streched at declinations away from the equator.

Then what you get from a select statement is:

cgp=# select * from mystars;
  ra  | decl  | poserr | vmag |                                  errbox                                   
------+-------+--------+------+---------------------------------------------------------------------------
 12.3 | -4.56 |   0.01 | 12.9 | (12.3100317540404,-4.55000000022352),(12.2899682459596,-4.56999999977648)
(1 row)

Creating an Index on a Spatial Column

Creating an R-tree index is very simple, but Postgres has one quirk, after creating an index of any type it is necessary to run an ANALYZE operation or else the query optimiser will not make use of it under any circumstances.

CREATE INDEX myindex ON mystars USING RTREE(errbox) ;
ANALYZE mystars ;

Using a Spatial Column

To find all point contained within a given area, you need to construct a box around that area, and use the && operator in the selection. For example:

SELECT ra,decl,vmag FROM mystars WHERE errbox && BOX(POINT(179.0,1.1),POINT(179.1,1.2)) ;

In a similar way you can do a spatial join between two tables which have compatible columns of spatial type. It is only necessary to have an R-tree index on one of the tables, preferably the larger one. Suppose you have two tables called mystars and usno, the join syntax might be something like this:

SELECT mystars.ra,mystars.decl,usno.ra,usno.decl FROM mystars, usno WHERE
 usno.errbox && mystars.errbox ;
If the columns have unambiguous names you do not need to precede them with the table name. This selection will return rows where the rectangular boxes overlap; if you want only those where the circles overlap, within limits of the two error-radiii, then a bit more trig is needed. In the case of astronomical or geographical coordinates you need to compute the great-circle distance, not the simple cartesian one. Unfortunately the simplest expression for great-circle distance does not work at all well for small distances; a better one is that based on the haversine formula. This is used here:

SELECT mystars.ra,mystars.decl,usno.ra,usno.decl FROM mystars, usno WHERE
 usno.errbox && mystars.errbox AND DEGREES(ASIN(SQRT(POW(SIN(RADIANS(mystars.decl-usno.decl)/2)
 + COS(RADIANS(mystars.decl)) * COS(RADIANS(usno.decl)) * POW(SIN(RADIANS(mystars.ra-usno.ra)/2)))) 
 < mystars.poserr + usno.poserr ;

Further Information

The Berkeley Computer Science department, where Postgres originated, have done much work on generalised search trees, and this may be found at the GiST site.

A group of people at the Sternberg Astronomical Institute of Moscow State University are developing a package for Postgres called PgSphere, which will support R-tree indexing using spherical-polar coordinates. At the time of writing the software has not been released, but draft documentation is available from the pgastro web-site.


Update: Disc Space Usage

I guess it's a good thing that we ran out of disc space on hydra (and on cass123 nearly) as it made me investigate, belatedly, at the disc space used by Postgres for its tables. For a start there's an overhead per row of about 40 bytes, although this depends on the structure of the table. It is used to store pointers, null flags, and other stuff.

More seriously, if you modify a row, it gets copied to a new location on disc, and the old version just marked as inactive. This means that the UPDATE statements which populate the errbox column will cause every row to be duplicated. There seems to be two ways of recovering this space

  • Carrying out a VACUUM FULL operation: this actually releases space, whereas a VACUUM alone moves it to the end, where it could be used only by new records in the same table.
  • Using a CLUSTER command, which creates a new table (and index) and deletes and renames the old one.
If the initial table data used to create the table do not contain errbox values, the next steps are to use an ALTER TABLE ADD COLUMN option, then an UPDATE to populate the column. Then either
  • Use CREATE INDEX to create the R-tree, and then CLUSTER on that index to produce a table copy.
or
  • Use VACUUM FULL to remove the obsolete rows, then CREATE INDEX on the clean version, then optionally CLUSTER on the result.
It is not clear yet which route is better. The manual suggests that CLUSTER may be faster than VACUUM FULL for large tables.

What is clear is that it is better to load the table with a ready-populated errbox column, as populating this afterwards causes a lot of disc space to be wasted, and recovering it is tedious.


Loading 2MASS into Postgres.

This section lists all the steps needed to load the 2MASS catalogue from data distributed by IPAC at Caltech into Postgres, and set up a spatial index.

(1) Create a Postgres database

This uses the createdb command - there are lots of options, in particular you can specify here where you want the data to go, if you didn't do this when you used the initdb command. These examples assume that you want the database to be called "mydb". From your shell:

  > creatdb mydb

(2) Create the table

These examples assume that the table is to be called "twomass", since a table name cannot (I think) start with a digit.

  > psql mydb
Welcome to psql 7.3.4, the PostgreSQL interactive terminal.

  Type:  \copyright for distribution terms
         \h for help with SQL commands
         \? for help on internal slash commands
         \g or terminate with semicolon to execute query
         \q to quit

  mydb=# 
At the psql command prompt (which reminds you of your database name) you can enter the following. It is sensible, of course, to have the CREATE TABLE command in a text file, which can be loaded with just \i filename

  CREATE TABLE twomass (
    ra double precision,
    decl double precision,
    err_maj real,
    err_min real,
    err_ang smallint,
    designation character(17),
    j_m real,
    j_cmsig real,
    j_msigcom real,
    j_snr real,
    h_m real,
    h_cmsig real,
    h_msigcom real,
    h_snr real,
    k_m real,
    k_cmsig real,
    k_msigcom real,
    k_snr real,
    ph_qual character(3),
    rd_flg character(3),
    bl_flg character(3),
    cc_flg character(3),    
    ndet character(6),
    prox real,
    pxpa smallint,
    pxcntr integer,
    gal_contam smallint,
    mp_flg smallint,
    pts_key integer,
    hemis character(1),
    date date,
    scan smallint,
    glon real,
    glat real,
    x_scan real,
    jdate double precision,
    j_psfchi real,
    h_psfchi real,
    k_psfchi real,
    j_m_stdap real,
    j_msig_stdap real,    
    h_m_stdap real,
    h_msig_stdap real,
    k_m_stdap real,
    k_msig_stdap real,
    dist_edge_ns integer,
    dist_edge_ew integer,
    dist_edge_flg character(2),
    dup_src smallint,
    use_src smallint,
    a character(1),
    dist_opt real,
    phi_opt smallint,
    b_m_opt real,
    vr_m_opt real,
    nopt_mchs smallint,
    ext_key integer,
    scan_key integer,
    coadd_key integer,
    coadd smallint
  ) WITHOUT OIDS;
The "WITHOUT OIDS" option makes for a slightly more compact table, I think, and has no ill-effects that I can discern.

These column names are those listed in the files which come with the 2MASS distribution from USNO. The meaning and units can be found from the 2mass web-site. Another useful source of information is [[http://vizier.u-strasbg.fr][Vizier] where the ReadMe file (see link in the coloured bar near the top) gives details. If you check the UCD box, you can see the Unified Content Descriptor (UCD) of each column too.

(3) Load the data.

This can be done, in theory, all in one command from the shell, which unzips the files distributed by USNO and executes LOAD commands within Postgres. The command is:
  > zcat psc*.gz | psql -c "copy twomass from stdin with delimiter '|'" mydb

(4) Add an error-box column and populate it

All the following operations are done within psql, but I have not bothered to reproduce the command-prompt. The first step is to add a column of type box to contain the rectangular error-box, this is a very fast operation, the records are only modified when they are actually populated with data:
  ALTER TABLE twomass ADD COLUMN errbox BOX ;

  UPDATE twomass SET errbox =
   BOX(POINT(ra + err_maj/(3600.0*COS(RADIANS(decl))),
           decl + err_maj/3600.0),
       POINT(ra - err_maj/(3600.0*COS(RADIANS(decl))),
           decl - err_maj/3600.0)) ;
If you think these limits, set by the err_maj column, are too conservative, then modify the formula accordingly. It is sensible to have the minimum bounding rectangle in the errbox to be larger than any likely search radius that you want to use, and efficiency does not suffer all that much if it is a bit larger than necessary (up to the point that significant numbers of these boxes overlap).

(5) Create the spatial index

The ALTER TABLE command below is only needed if you want to use a clustered index: this improves performance by grouping the data so that values nearby in index terms (which means nearby on the sky) go into adjacent disc blocks as far as possible. Because of some limitation in Postgres, you can only have a clustered index on a column which is declared to be null-free. The errbox column is, but Postgres does not know that, and indeed if you use this command it will check, and take some time over it. If you don't want a clustered index, go right on to the CREATE TABLE command, and ignore the CLUSTER command which follows it.

  ALTER TABLE twomass ALTER COLUMN errbox SET NOT NULL;   -- only needed for clustered index

  CREATE INDEX tbox ON twomass USING RTREE(errbox);

  CLUSTER tbox ON twomass;       -- only needed to cluster the index

  ANALYZE VERBOSE twomass;       
The ANALYZE command is needed after any index creation, or any significant change to one. The VERBOSE is optional, but gives some information on the size of the index.

(6) Test the index

A suitable command might be:

  SELECT COUNT(*) FROM twomass WHERE BOX(POINT(123.4,45),POINT(123.9,46)) && errbox ;

(7) Optional: Create a declination index

You may want to create also an index on just declination to see how much slower it is, e.g.
  CREATE INDEX tdec ON twomass(decl);
If you do this after the ANALYZE command, then you will need to run ANALYZE again. The advantage of a dec index is that it supports simpler queries like this:
  SELECT COUNT(*) FROM twomass WHERE ra BETWEEN 123.4 AND 123.9 AND decl BETWEEN 45 AND 46;
But these will execute more slowly. It is instructive to compare the performance of these indices: the dec index always performs at much the same speed, as it has to physically retrieve lots of rows from the table, and then throw most of them away when it later filters on RA. The spatial index gets faster if it is re-run, as most of the necessary rows are already in the disc cache.

Acknowlegement

This publication makes use of data products from the Two Micron All Sky Survey, which is a joint project of the University of Massachusetts and the Infrared Processing and Analysis Center/California Institute of Technology, funded by the National Aeronautics and Space Administration and the National Science Foundation."

-- ClivePage - 10 Feb 2004.

Topic revision: r5 - 2004-02-10 - 16:56:30 - ClivePage
 
AstroGrid Service Click here for the
AstroGrid Service Web
This is the AstroGrid
Development Wiki

This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback