Clive Page
2002 May 9
The most important feature of a Database Management System (DBMS) is, arguably, that you can find and extract the record you want in a dataset by specifying its contents, you do not need to know the record number, disc block, or the order in which the records were inserted. But finding the required record by sequential scan becomes very timeconsuming with large datasets, for example it may take a day or more to scan a terabyte of data. An efficient index, however, should be able to find any record in two or three disc seeks, say 30 ms. The art of using databases is largely the art of avoiding unnecessary I/O, since a disc seek takes typically 6 to 10ms, whereas memory can be accessed in less than 50ns, some 100,000 times faster.
In astronomy, the largest datasets are source catalogues, essentially lists of stars and other celestial objects giving their positions, magnitudes, and other properties. Several source catalogues in current use have over 500 million rows, and the largest ones will surely exceed a billion rows very soon. Collections of images and of raw data also occupy lots of disc space, but their indexing requirements are relatively modest: it typically takes only a few tens of thousands of images from a sky survey can cover the entire sky, so the index is quite small. This note will therefore concentrate on the indexing requirements of large source catalogues.
The example of a source catalogue which will be used in this document is the USNOA2 catalogue, not the largest source catalogue, but fairly typical of the larger ones. USNOA2 contains 526,280,881 rows, and effectively 6 columns. It was produced by the US Naval Observatory using a precision measuring machine to scan plates from the Palomar Observatory Sky and UK Schmidt sky surveys, with calibrations using the Tycho catalogue. It was distributed in a slightly compressed form on a set of 11 CDs.
Since a source catalogue is essentially just a big table, one might think that a relational DBMS, designed to handle tables, would provide all the necessary facilities. Unfortunately, the most common method of access to a source catalogue is by position on the sky, which for efficiency requires a twodimensional index. This brings with it the following main problems:
The most common search criterion astronomers use with a source catalogue is search for sources at or near a given celestial position (RA, DEC). Because coordinates are always a little imprecise and represented by floatingpoint numbers, one cannot normally search for an exact match, what one wants is all sources with positions lying within a small error region around the specified position. If, as is normally the case, the errors in RA and DEC are equal and independent, then the error region is actually an error circle. This type of search is often called search in a cone because if you think in three dimensions, you have a cone stretching from the eye (or telescope) out to the errorcircle on the imaginary celestial sphere.
Another common requirement is for the crossmatching of sources in one
catalogue with those in another, for example from a telescope in a different
waveband, or at a different epoch. Interpreting the results requires some
astronomical expertise, but the basic database operation is that of a JOIN
between two tables, based on the positions in the two tables being coincident
within the sum of the radii of the error circles (defined to some chosen
probability level). This operation has various names such as greatcircle
join (because one needs to use the greatcircle distance between positions,
the simple cartesian distance being incorrect away from the equator), but
the term fuzzy join is in common use, and seems simplest. Note that
this has nothing to do with dodgy concepts such as fuzzy logic.
The fuzzy join is of considerable importance because crossmatching is the
basic operation which underpins a whole range of more advanced data mining
operations, on which the new science of the Virtual Observatory depends.
But join operations in databases are intrinsically expensive as they require
a large number of search operations, potentially one per row of the first
table of the two tables. In many astronomical applications the fact that
a source in one catalogue has no counterpart in the other is also of interest:
this requires the left outer join between tables . This requires the
first table to be scanned sequentially, since the table produced will have
at least one row for each row in the first input table, and an indexed lookup
in the second table for each row in the first. This obviously requires an
efficient spatial index to be available for the second input table, and for
the join operation to be capable of using that index. Because this involves
the sequential scan of one table (usually the smaller) and indexed access
to the other, the fuzzy join is representative of many astronomical queries.
It is surprisingly hard to obtain even basic information on the spatial indexing features supported by commercial DBMS, such as the spatial algorithms they use and the data types and operators supported. It is almost as if they are ashamed at their offerings. I think the following details are accurate, but they are gleaned from Usenet newsgroups and the more remote corners of the websites of various vendors (sometimes from disparaging remarks made by their competitors).
IBM's DB2 has an optional Spatial Data Extender produced jointly with a company called ESRI, a supplier of Geographical Information Systems. This is an additional cost option to DB2 and is said to be based on a 3stage grid; whether this means it is based on the original gridfile of Nievergelt et al, I have been unable to discover. IBM's website says there is a white paper on the Spatial Data Extender, but the link was broken. I think I eventually found this paper, but it reveals practically nothing of its algorithm or properties. I do not know of any use in astronomical applications.
The Informix Universal Server has a Spatial Datablade Module, apparently
based on Rtrees. A notable feature of Informix Dynamic Server is the datablade,
a chunk of lowlevel code which can implement a userdefined data type, and
indexing functions on it. Sufficiently expert users can write their own datablades
(though I think the code has to be approved by Informix), but some standard
ones are provided.
A project to investigate Informix datablades as a way of handling astronomical
catalogues was undertaken jointly by the University of Maryland Baltimore
County and NASA Goddard Space Flight Center; they used the USNOA2 catalogue
(or rather a 1/3000th part of it) as their test dataset. Their experiences
are described in a collection
of papers. They compared the performance of a number of operations using
essentially four different options: simple Btrees on either RA or DEC, two
different spatial datablades, and a datablade of their own, called Simpleshape,
which used their own indexing using Rtrees.
Based on their small sample, their estimate of the time taken to load and
index the whole USNOA2 table are rather remarkable:
Schema  Load time (days)  Index time (days)  Table size (GB)  Index size (GB) 

Relational+Btree  3  6  13  56 
Geodetic blade  14  25  190  104 
Shape2 blade  10  15  140  77 
Simpleshape  1  1  15  20 
These times may appear excessive, but the estimated time taken to create a Btree index using the Geodetic datablade was no less than 94 days. The reduced load time of Simpleshape was because they developed their their own data loader (to overcome bugs and limitations of the Informix original) as well as their own indexing method. Their measurements of join included only a selfjoin of a table of 60,000 stars: this took 3072 seconds, i.e. around 20 joins per second. This does not seem especially fast to me.
The history of the product is complicated. The originator of Postgres,
Michael Stonebraker, took his product to market forming a company called
Illustra. This was taken over by Informix, but they apparently found
the product unsatisfactory in some way, and reimplemented the ideas in Informix
Dynamic Server (IDS), which included the datablade concept. Another version
of their product turned into Informix Extended Parallel Server (XPS). They
started project Arrowhead to reintegrate these two products, but this was
still not complete when the company was taken over by IBM. Reports
suggest that IBM expect Informix users to migrate gradually to DB2, but perhaps
only when the best features of both Informix servers are fully integrated
into DB2. This leaves the future of the Informix products somewhat
uncertain.
The Sloan digital sky survey query system, SX, was originally implemented
on Objectivity/DB, but has now been ported to SQL Server. The sky indexing
uses HTM, written by the astronomers involved. I do not know of any other
major astronomical application which has used this DBMS.
Mysql is an open source relational DBMS with a reputation for efficiency.
Its support for transactions has been criticised (though this is gradually
being improved) but for static datasets the product has a good reputation,
and is used by LEDAS and a number of other astronomical data archives. There
is no special support for spatial data.
Oracle8i has a Spatial Data Option: the only references on the open web that I could find suggested that it was based on something called the HHCODE (Helical Hyperspatial Code). It seems to use a spacefilling curve (one reference says a Peano curve) used to map two dimensions to one, so that a standard Btree can be used. The mapping technique is considered in detail section 6, where some fundamental limitations are noted. An Oracle white paper supplied by the company also says that an Rtree index can be used, and that it also supports geodetic coordinates (which seems to mean greatcircle distance over sphericalpolar coordinates). This suggests that we ought to include it in our evaluation. Although Oracle is used to host several astronomical data archives, I do not know of any using the Spatial Data Option.
The Department of Computer Science at Berkeley, which produced the original Ingres, went on under the leadership of Michael Stonebraker to produce an objectrelational DBMS called Postgres. Recent versions support SQL and are known as PostgreSQL. Although Postgres supports both Btree and Rtree indices, the latter are not well documented. Many books are available on Postgres, but few even mention Rtrees at all. The documentation which comes with the software is also distinctly sketchy in this area. As far as I can make out, the Rtrees do support queries using the cartesian distance operator <> (though whether this is the distance between rectangle centres or the nearest points is not clear), and also an boolean overlap operator &&. This is clearly something to evaluate.
Some tests of Rtrees were reported in the paper Objectrelational DBMS for Large Astronomical Catalogue Management by Baruffolo and Benacchio of Padova Observatory. They an astronomical catalogue of about one million rows as their test dataset. The search time for a Btree index was around 7 seconds, but less than one second with an Rtree. A later paper by the same team reports their switch to using Informix Dynamic Server. Meanwhile Postgres is said to have improved in efficiency.
One problem noted by Baruffolo et al is very slow data loading and index generation, even on a mere million rows. Of course the operations need, in principle, only to be performed once. Another area of concern is the disc space overheads.
These overheads can be estimated from published details of Postgres. Again, I shall use USNOA2 as the example. The Postgres Rtree is based on rectangular boxes (one specifies a coordinate pair for two opposite corners). To store this in a Postgres table with an Rtree index will expand the data from its original 6.3 GB to around 74.3 GB (based on reasonable assumptions on the packing efficiency of the Rtree nodes), an expansion factor of nearly twelve. Of course disc space is cheap, but the expansion factor also affects the data bandwidth when making retrievals from the database.
If the Rtree is not found to be satisfactory, then there is another possibility: the work of the Berkeley CS group has continued with Generalised Search Trees (GiST ). These have been interfaced to Postgres, but not yet fully integrated (according to some newsgroup postings). I have been in touch with Oleg Bartunov at the Sternberg Astronomical Institute in Moscow who has made a number of interesting contributions to the Postgres newsgroups on the subject. His code can be downloaded, but it is written in C with hardly any comments, and the only documentation was in Russian. The indexing problem he was trying to solve is in fact different from ours, but he thought that with some work it might well be possible to use a generalised search tree for sky indexing.
Sybase has a spatial option involving external software, SQS from Autometric, a division of Boeing Corporation, but I could not find much information on it. Sybase is used in a number of astronomical archive sites, but as far as I know, none of them use a spatial index.
It occurred to me that vendors of GIS might have already solved some of these problems of indexing the sky, since the surface of the earth is also roughly spherical. I therefore spent a couple of days trying to find out about GIS, but eventually concluded that these systems do not have much to offer the astronomer. There are a couple of reasons for this. firstly most GIS applications only cover a rather limited area of the Earth, so users simply adopt a map projection which removes the worst distortions; in the polar regions for example one would simply use an orthographic projection centred on the pole. And the polar regions are so sparsely populated that they are of little importance in most GIS projects. Secondly, most GIS systems are much more concerned about the representation of linear features such as roads, pipelines, and political boundaries, and in converting between vector and raster representations of these, whereas linear features barely figure in astronomy. While 2dimensional indexing seems to be a recognised problem in GIS, it did not seem to be of particular importance, and it was not clear to me that there was a standard solution. Most commercial GIS are built on top of some DBMS, and where spatial indexing is required, they just use the spatial data features of this DBMS.
To the best of my knowledge nearly all existing astronomical data archives make do with just a onedimensional index of the sky, on either RA or DEC. This is probably because twodimensional indexing is so rarely provided and difficult to use. Before moving on to twodimensional indexing, therefore, I want to consider indexing by just one coordinate, and its limitations.
Perhaps the first question to consider is whether to index on RA or DEC. One factor to take into account is that if sources are uniformly distributed on the sky they will also be uniformly distributed in RA, whereas in declination the distribution will be far from uniform because the area of sky per declination strip has a cosine function. This would seem to give an RA index better selectivity. A more important factor, however, is the wraparound at zero RA, which would require special code or duplication or rows to handle properly. The declination axis has no such problem, and for this reason, I think, most astronomical data archives and cataloguehandling systems have chosen to use a declination index.
To get efficient access to a onedimensional table there are three obvious methods:
Let us suppose we have a catalogue of celestial sources which includes columns called RA and DEC, with the coordinates given in degrees (radians might be more efficient, but the numbers in the examples would need more decimal places which makes them harder to follow, even though the RADIANS/DEGREES functions could be omitted).
This SQL query is designed to find all rows with positions within 0.01 degrees of RA 123.45 and DEC 45.67 in a table called CAT:
SELECT * FROM CAT WHERE DEGREES(ACOS(SIN(RADIANS(DEC)) * SIN(RADIANS(45.67)) + COS(RADIANS(DEC)) * COS(RADIANS(45.67)) * COS(RADIANS(RA123.45)))) < 0.01 ;
It is based on the simplest formula for the greatcircle distance between two points, where sphericalpolar coordinates are used. Unfortunately this expression is complex enough that it is certain to baffle all known query analysers, so that the DBMS will use a sequential scan of the data.
Perhaps I should point out that I realise that this simple formula is not very good if the angles are small (as they will often be here) because at least one cosine will be very close to unity and the resulting accuracy will be very poor. The next simplest formula without this problem involves haversines, which are not in Standard SQL. The haversine and its inverse could be provided as userdefined functions, but these are not Standard SQL either. There may be a portable solution to this, but the question is irrelevant, since it is clear that this type of expression is of no use anyway, if an indexed solution is required.
When an SQL query is sent to a DBMS, it is parsed, analysed, and processed by a query optimiser, before the necessary sequence of operations is scheduled. A complex SQL query can involve quite a number of operations: selection, joining, sorting, grouping, etc. The query optimiser tries to work out the fastest way of getting the result, which will often involve using an index, if a suitable one exists. But for small tables a sequential scan may actually be fastest, and it is also possible to submit a query containing expressions so complex that the optimiser is baffled into doing a sequential scan. DBMS are very clever pieces of software, but they are tuned to the types of query most commonly encountered in the commercial world, which is where nearly all their revenue comes from. The examples above illustrated some of the problems.
Next let us try an alternative approach, searching not in a circle but in a square box enclosing the circle of halfwidth (and halfheight) 0.01 degrees, which simplifies the syntax considerably:
SELECT * FROM CAT WHERE RA BETWEEN (123.450.01) AND (123.45+0.01) AND DEC BETWEEN (45.670.01) AND (45.67+0.01);
If an index is available for either RA or DEC then any competent query optimiser will use it to speed up a query of this form. If there are indices for both RA and DEC the optimiser has a choice. A good optimiser will use whichever seems likely to result in faster execution, by estimating the selectivity of each index from any information it has about the distribution of values in each column. Suppose it chooses the RA index, it will use this to select the subset of rows satisfying the condition on RA, and for each of these also evaluate the condition on DEC. But, because of the way that indices work, there is no way to utilise both of them at once Each index just returns a list of records selected from the base table. In principle, one might think, the combination of conditions could be handled by returning a set of row numbers from the application of each index separately, and then selecting only those rows which are present in both sets, that is: implement the AND condition by a setwise intersection. I do not know of any DBMS which does anything like this at present, although I suspect that search engines such as Google do it. But many of the details of commercial DBMS are kept secret, so I could be wrong about this.
The square box obviously returns more rows than required, because its area is larger than the circle. The easiest way to get SQL to do the full selection is first to use an indexed selection on one of the coordinates, and then apply the greatcircle formula to the resulting set of rows. Assume that we have an index on DEC, to avoid the complications of wraparound, and then use the greatcircle selection. Here is what result:
SELECT * FROM CAT WHERE (DEC BETWEEN (45.670.01) AND (45.67+0.01)) AND (DEGREES(ACOS(SIN(RADIANS(DEC)) * SIN(RADIANS(45.67)) + COS(RADIANS(DEC)) * COS(RADIANS(45.67)) * COS(RADIANS(RA123.45)))) < 0.01);
This, I think, selects the required rows, and is simple enough SQL that any reasonable DBMS will know that it can make use of an index on DEC.
The average astronomer, one might guess, will not be very happy if forced to write queries such as the one shown above. One simplification possible in many SQL systems to provide userdefined functions, so that one could use a greatcircle distance function, such as gtcirc(ra1,dec1,ra2,dec2), instead of the rather complicated expression given in the last example. But userdefined functions are not part of Standard SQL, and ways of defining them vary considerable in practice.
Even with such measures, SQL is not easy to use, and Jim Gray noted that astronomers using the SLOAN Sky Server used only the most basic types of query. I think that we may have to accept the necessity of designing our own astronomical query language, which can be preprocessed into the required SQL.
The SQL devised so far has, of course, ignored the questions of errors in the RA and DEC columns in the table. For some catalogues this error is small and much the same for all rows, so a fixed value may be appropriate, and even listed in the metadata. In other cases there is a positionerror column called something like PERR. This is typically a standard statistical error (sigma) and the prudent user will want to search out to a distance of perhaps 2 or 3 sigma to include all possible matches. If there is a separate position error for each row then it is quite hard to know how to deal with it, since we do not know without looking at each row whether to include it in any given search or not, so it somewhat defeats indexing. In practice, the errorradii do not usually cover a very wide range of values, and a pragmatic solution is to estimate, once and for all, the largest error radius of all rows in the entire table. This value can then be added to the error criterion (0.01 in the previous examples) used in each test.
Errors are sometimes elliptical or even more complex in shape: it is hard to cope with such cases in a simple way. The best solution is probably to use the radius of the exscribed circle, and later filter out the results which are not within the actual error region.
Using a onedimensional index is better than nothing, but to appreciate the gains from using a twodimensional index consider the following example.
Suppose we want to search the USNOA2 source catalogue for the counterpart of an Xray source which has a positional error of 3 arcseconds: the area of an error circle of this radius is 28.27 square arcseconds. The USNOA2 contains 525 million sources; if these were distributed randomly on the sky there would be one source per 1018 square arcseconds, so the probability of a source occurring at random in a circle of area 28.27 square arcseconds would be around 2.78 percent, on average. Thus in a randomlychosen circle of this size there will usually be no matching USNOA2 sources, but one will be present about 3 percent of the time, and more than one even more rarely. If we only have an index on one coordinate, let us say declination, the best that we can do is use it to select a strip of sky along the whole RA axis 6 arcseconds high: this has an area of up to 7,776,000 square arcseconds (depending on declination, it will be exactly that on the equator). A strip of that area will typically contain 7639 sources. We have to retrieve all these rows, and then use the greatcircle distance expression to scan through them all to find whether one or two (but most probably zero) actually lie within the specified circle. This would not necessarily need 7639 separate disc seeks, if the data in the RA strip were optimally striped on the disc, but the query would probably take a few seconds, and might take a bit longer. With a 2d index, however, there is no reason why the necessary row (or rows) could not be retrieved in under 30 milliseconds.
Obviously, a JOIN operation on a table of more than a few thousands rows will be exceedingly slow if only a onedimensional index is available.
Multidimensional indexing has been a hot topic in computer science departments of universities for many years (although the UK does not seem to have been very active in this field). Not only is the topic inherently challenging, because a good solution to the 1d problem was found so long ago in the form of the Btree, but there are many profitable areas of commerce where a good multidimensional indexing method would pay considerable rewards, for example mining and oil prospecting. Spatial problems arise in many other areas, of course, for example in microchip design, circuit board layout, and in the design of buildings and engineering structures of all types. An efficient multidimensional indexing system would also be of use in a wide range of queries on multiple parameters. Consider, for example, querying a vehicle database like this:
SELECT * FROM CARS WHERE COLOUR = "GREEN" AND MODEL = "ASTRA ESTATE" AND YEARLETTER = "G";
Or a star catalogue like this:
SELECT * FROM STARS WHERE SPTYPE = "G4" AND VMINUSB BETWEEN 0.1 AND 0.2 AND KMINUSL > 0.7;
Such queries could be speeded up considerably if a multidimensional index were set up on two or more of the fields in this selection. For all these reasons, the computer science literature abounds with multidimensional indexing algorithms.
There are in fact two distinct options for spatial indexing
The fact that there are so many different algorithms suggests to me immediately that the problem has not yet been cracked to general satisfaction. As far as I know, no multidimensional algorithm has yet been proposed which has all of the most desirable features of the Btree, i.e. fast insertion, deletion, lookup, and sequential retrieval, and a guarantee that it remains a balanced structure, and that all of its nodes below the root will be at least 50% full under all circumstances. .
The only algorithm which appears to have been sufficiently wellregarded to have been integrated into one or two DBMS is the Rtree, invented by Guttman (Proc. ACM SIGMOD International Conference on Management of Data, pages 47  57, 1984). Postgres is one of the few to incorporate the Rtree, and since it is also an open source product, it is high on our list of DBMS to evaluate. It could be that one of the more recent algorithms is better, but at present the Rtree is the market leader (in a very limited market)
The Rtree is defined not on points, but on rectangular regions of space.
At first sight this might seem like overkill, since many source catalogues
only have lists of what appear to be point sources. In fact, it should be
helpful, as it ought to solve the sphericalpolar distortion problem. When
loading the data into the Rtree it is necessary to compute the coordinates
(RA, DEC) of two opposite corners of the rectangle which encloses the errorcircle.
Allowance can obviously be made at this stage for the increased RArange
which an errorcircle occupies at declinations away from the equator. Indeed
if the spherical trigonometry is done correctly, it should be possible to
define the enclosing rectangle correctly even for a point exactly at the north
or south pole.
Since onedimensional Btrees are so efficient, and so well integrated into query optimisers, whereas indexing in two or more dimensions appears to be so difficult, other solutions to the problem have been sought. One, perhaps more of a fudge than a solution, is to devise a mapping from the twodimensional surface to a onedimensional quantity upon which a conventional Btree can be built. An important refinement is to fill the space with a spacefilling curve, so that points nearby in spatial terms are also nearby on the line. This means that a range query in space can be converted to a linear range, so permitting the use of an ordinary Btree.
Of course, as any topologist will tell you, a 2d to 1d mapping cannot be done without losing something. The question is, whether it can be done usefully at all.
One of the simplest mappings using a spacefilling curve, and one which seems to have been invented independently a number of times, is variously called the bitinterleaved index, or Zorder index. A simple way to visualise this in an astronomical application is as follows: take the RA and DEC coordinates, scale each to the range 0 to 65535, and convert to an integer. This scaling has been chosen to fit into a 16bit (short) integer, and gives a cell size which (at the celestial equator) is just under 20 arcseconds in RA by 10 arcseconds in declination. Next we interleave the bits, by taking one bit from each alternatively, to make a 32bit integer. Obviously the method can be extended to smaller cells on the sky by using 32bit coordinates, producing a Zorder index of 64bits, upon which an ordinary Btree index can be constructed.
An important feature is that in any small patch of sky, small offsets in RA or DEC will generally only change a few of the loworder bits in each coordinate, which will cause changes to no more than twice as many loworder bits of the Zorder index. So as a general rule, nearby points have similar Zorder values. This leads to the idea that one can search for all points in a given small patch of sky by specifying only the lowest and highest Zorder values in that region, that is one can convert a spatial range to a linear range in the Zorder index. To illustrate this simply, the table below shows a pair of 4bit coordinates interleaved to produce an 8bit Zorder index, with values from 0 to 255. Note that if you try to draw a line through the cells in sequential order, you get a repeated pattern of the letter Z, hence one of its names.
0  1  4  5  16  17  20  21  64  65  68  69  80  81  84  85 
2  3  6  7  18  19  22  23  66  67  70  71  82  83  86  87 

8  9  12  13  24  25  28  29  72  73  76  77  88  89  92  93 
10  11  14  15  26  27  30  31  74  75  78  79  90  91  94  95 
32  33  36  37  48  49  52  53  96  97  100  101  112  113  116  117 
34  35  38  39  50  51  54  55  98  99  102  103  114  115  118  119 
40  41  44  45  56  57  60  61  104  105  108  109  120  121  124  125 
42  43  46  47  58  59  62  63  106  107  110  111  122  123  126  127 
128  129  132  133  144  145  148  149  192  193  196  197  208  209  212  213 
130  131  134  135  146  147  150  151  194  195  198  199  210  211  214  215 
136  137  140  141  152  153  156  157  200  201  204  205  216  217  220  221 
138  139  142  143  154  155  158  159  202  203  206  207  218  219  222  223 
160  161  164  165  176  177  180  181  224  225  228  229  240  241  244  245 
162  163  166  167  178  179  182  183  226  227  230  231  242  243  246  247 
168  169  172  173  184  185  188  189  232  233  236  237  248  249  252  253 
170  171  174  175  186  187  190  191  234  235  238  239  250  251  254  255 
To see the locality of reference, a couple of regions of 3 by 3 cells have been shown in bold (and colour in the online version). Square regions are shown, which you can imagine might be an approximation to a small errorcircle. The 3x3 set (in green) around cell number 25 covers a range of Zorder values from 18 to 30. One could build a Btree on the Zorder values, and use a simple onedimensional range search to access cells in this patch of sky, except that it would return 13 cells when only 9 are needed, so a little subsequent weeding is necessary. This works in most parts of the plane, but unfortunately not everywhere.
The set of 3x3 cells (shown in yellow) around cell number 63 shows what can go wrong: here the range of Zorder values is 60 to 192, so that if you were to do a range search with these limits the resulting collection of cells would cover nearly half the total area, when only 9 cells are really wanted. The Zorder index, therefore, works reasonably well in most places, but goes spectacularly wrong in a few places when one of the highorder bits flips.
A 32bit index is just about good enough for current astronomical tables:
the cell size of around 200 square arcseconds has on average 0.2 sources
in a catalogue of the size of USNOA2, so the Zorder index of a row will
mostly be unique, but larger catalogues will need 64bit integers for their
Zorder values to be unique. Even if the index in unique, and the errorradius
much less than the cell size, there will always be some sources located close
to a cell boundary or corner, which gives them more than one Zorder value
per row.
It is easy to see that the average performance of the Zorder index in range searches may be rather poor. Even if just a small number of places in the sky are affected, if these cause a sequential scan of most of the table, they have a huge effect on the average. Estimating this average performance analytically was beyond my mathematical ability, so I did what all physicists do when the maths get too hard: cheat. I wrote a small program, a hundred or so lines of Fortran95, to set up a Monte Carlo simulation of the problem, with the added but necessary complication of sphericalpolar coordinates. Using 16bit coordinates for RA and DEC and thus a 32bit Zindex, which leads to a 20 arcsecond cell, my results were as follows:
Radius of errorcircle, arcseconds  Average No. of cells covered 

1  3,915 
3  10,135 
10  26,776 
30  115,718 
100  435,365 
This is a remarkably poor performance, indeed only for the very smallest search radii is the Zorder index better than using a simple onedimensional index on declination alone.
There are several other spacefilling curves with similar properties, the most wellknown perhaps being the Peano and Hilbert curves. Some of these have also been proposed as 2d to 1d mapping functions. I have not tried to simulate any of these, but all have the same basic drawback, of failing when a highorder bit flips in either coordinate. I would not expect any of them to have a much better performance.
Part of the problem is the distortion around the pole, which means that a small errorcircle intersects a lot of cells if these are set up on a simple grid (as in a Mercator projection). This has led astronomers to seek alternative mappings, appropriate for the surface of a sphere, with more nearly equal areas per cell all over the sky. The two best known are HTM and HEALPix.
HEALPix is a Hierarchical Equal Area isoLatitude Pixelisation of the sky invented by Krzysztof Górski and colleague. It is a development of the Quadrilateral Spherical Cube, devised for processing COBE data. HEALPix starts with a division of the sky into 12 segments, each of which is then subdivided into roughly square pixels; these can be subdivided into four smaller squares, and so on recursively. One important design feature of HEALPix is that the pixels are all of approximately equal area on the sky: this facilitates the use of 2dimensional Fourier transforms and similar techniques to investigate the largescale structure of the sky, which are not of interest here. There are two ways of numbering the pixels, the ring and the nested schemes. The nested numbering has the desirable property (for our mapping to one dimension) that adjacent pixels tend to have numbers differing only in loworder bits. A few special pixels have only seven nearest neighbours, all the others have the usual eight. The HEALPix package is written in Fortran90, but there is also an IDL interface. Current implementations appear to be limited to 32bit pixel numbers, giving a resolution of around 20 arcseconds.
The Hierarchical Triangular Mesh
(HTM) was devised by Peter Kunszt, Alex Szalay, and Ani Thakar at Johns Hopkins University for use in the Sloan Digital Sky Survey. One might think that dividing the sphere into squares was hard enough, but the idea of using triangles seems to have been proposed first by Paul Barrett (ADASSIV proceedings, pp492495) in what he called the Quaternary Triangular Mesh. The HTM tessellation works as follows: the sphere is first divided into eight equal spherical triangles with 90degree sides; from then on down each triangle is divided into four subtriangles of approximately equal area by bisecting all three sides and joining these midpoints with greatcircles. This sounds as if it involves a lot of messy trigonometry, but in fact can be carried out fairly rapidly if the coordinates are expressed in 3vector notation rather than in sphericalpolars. At each level you subdivide the triangle into four, so two additional bits suffices to name them. A 32bit integer provides an index with triangles around 20  30 arcseconds in size. Going to 25 levels in the structure fills a 64bit integer and gives a typical resolution of 9.6 milliarcseconds. A library of routines written in C is available. One important function converts a given position (RA,DEC) to the HTM pixel number to the desired level of precision, and other routines can list all the HTM pixel numbers covering a given small region of sky. There is also a Java version, but it is reported to use only 32bit HTM codes. In most cases the triangles in a given small region of sky have HTM pixel numbers which are similar, differing only in loworder bits.
The only comparison of HTM and HEALPix that I have been able to find is this one by Wim O'Mullane et al. It notes that in practice HEALPix appears to generate indices about an order of magnitude faster than HTM, perhaps because the HTM calculation is more complex and requires a recursive descent (or perhaps because HEALPix is written in Fortran90, whereas HTM uses C and Java). This speed advantage may not matter much in practice, since most applications will be limited by I/O transfer rates, not cpu speed.
I have not had time yet to use either HTM or HEALPix in my simulation program, but I would expect that while both of these mappings solve the problem associated with the singularities at the poles, they do not overcome the more fundamental limitation that when a highorder bit flips, the locality of reference is lost, and a Btree search would have to cover a large number of rows in linear space.
Mapping functions have no problem when mapping a point on a plane to one on a line, it is the idea that a spatial range can be mapped to a reasonably contiguous set of points on a line which is fundamentally flawed. If this concept is abandoned, then an alternative procedure becomes clear.
Provided that not too many error circles cross pixel boundaries, the table size does not increase much. For USNOA2, which has typical error radii under 0.3 arcseconds, the increase in the number of rows would only be 0.1%.
The only functions needed on the mapping are the following:
Both the HTM and HEALPix libraries contain functions which effectively do this.
The new table can then be used in a search in a cone operation as follows. Find call pcode_list to find the set of PCODE values, then search for all of them in the PCODE Btree index. There may be a few duplicates as a result, because the search cone covered two pixels both of which were covered by the errorcircle around the same source in the table, but it should be easy enough to remove duplicates by standard DBMS techniques such as SELECT DISTINCT(this this may require a unique sourcenumber column also to be present in the table).
It should be possible also to carry out a fuzzy join between two tables, if they have both been processed by the same procedure. This needs only a simple equijoin on their PCODE columns, an operation for which most DBMS are highly optimised. Again there may be duplicates in the output table caused by overlaps of errorcircles which both cover two (or even more) pixels, and these will need to be removed.
The obvious question is whether the PCODE method is sufficiently scalable. Clearly the pixel size of the grid needs to be reasonably well chosen so that not too many sources lie in a single pixel (or the weeding stage has more to do), but so that the pixels are mostly larger than the errorcircle radius. It is a wellestablished rule of thumb in survey radio astronomy that one does not try to extract from the data more than one source per forty beamareas (and there are similar principles in other wavebands using different terminology). This should ensure that there is always a pixelsize larger than the beam (which is normally larger than the errorcircle radius) but smaller than the typical spacing between sources. It is perhaps fortuitous that a 32bit integer PCODE results in pixels 20 to 30 arcseconds in size, which is quite suitable for the current state of astronomy. When we get tables of more than a few tens of billions of rows, there will be rather too many point sources per pixel, and then it will be time to add more bits. No doubt by then 64bit processing will have entered the mainstream of computing.
It is important to note that with both HTM and the HEALpix nested numbering scheme, the recursive division of the sky produces a cell number which can be truncated to produce a PCODE value suitable for a coarser resolution. This should make it possible to produce a pixel index for each table which is scalable to the resolution required for any given join situation.
The principal disadvantage of the method is that it requires a new version of each source catalogue to be produced, with an additional column for the PCODE, and space for additional rows. In processing which does not involve the PCODE column, these extra rows would have to be ignored.
The great advantage of the PCODE method is that it can be used with any commonorgarden relational DBMS, with no special support for spatial data or query processing. The equijoin on integer fields is just what they are all supposed to be good at.
The use of spacefilling curves to turn spatial ranges into linear ranges seems intrinsically flawed, and the simulations suggest the performance is totally inadequate.
The Rtree indexing of Postgres, Oracle, and Informix seems worth evaluating; given the uncertain future of Informix, we should perhaps concentrate on the other two. Figures reported for data loading and indexing time for both Postgres and Informix give some concern.
A sky grid based on the PCODE idea (using HTM or HEALPix or something simpler) could be used as outlined in section (7) above to provide an efficient search and join on tables. This would require an extra column for the index code, and additional rows in the tables, but these overheads are relatively modest. This option also needs to be evaluated with suitable test datasets, but essentially any relational DBMS can be used.
Note added 09 Sept 2003: further work on this topic is reported in a separate wiki document SpatialIndexing
 ClivePage  09 May 2002
Note on 3.1: the spatial data extender package is provided by ESRI at www.esri.com. Infomation on their SDE product  ArcSDE 8  can be found at http://www.esri.com/software/arcinfo/arcsde/index.html  this seems to allow te management of geographic information in either IBM DB2, Informix, SQL Server or Oracle.
 NicholasWalton  16 May 2002
Click here for the AstroGrid Service Web 
This is the AstroGrid Development Wiki 

