CrossMatchingReport

Data Federation and Data Mining

1. Project Objectives

The objectives of the AstroGrid project were summarised in our original grant application as follows:
  • A working datagrid for key UK databases
  • High throughput data mining facilities for interrogating those databases
  • A uniform archive query and data mining software interface
  • The ability to browse simultaneously multiple datasets
  • A set of tools for integrated on-line analysis of extracted data
  • A set of tools for on-line database analysis and exploration
  • A facility for users to upload code to run their own algorithms on the data mining machines
  • An exploration of techniques for open-ended resource discovery
The importance attached to "data mining" is clear from the fact that it appears three times in these eight points.  Related operations such as browsing, exploring, and interrogating databases are also given prominence.  More recently the term "data federation" has come to the fore. This document tries to explain these terms, and report on the results of our recent research programme.

2. Data Mining and Knowledge Discovery from Databases

Data mining is generally defined as the discovery of information buried in massive databases.  One can mine images, for example, by developing more sensitive algorithms to detect previously hidden sources. These are interesting research projects, but have no particularly strong e-science component, since they mainly involve the development of specialist software for use on a single computer. 

We are more concerned with mining data from source catalogues, which are lists of sources of radiation (such as stars and galaxies) detected in images of the sky. In recent years there has been a explosive growth in thir size; this started some years ago with the digitisation of archives of photographic plates using high-resolution scanners, and has accelerated with the use of ever larger CCD detectors on modern telescopes. The resulting datasets are tabular in form: the longest have already reached a billion rows, and the widest tables have around 400 columns. Even larger datasets will be generated by telescopes coming on stream shortly, especially those from systematic sky surveys such as WFCAM and VISTA.  These source catalogues are held on servers at data archive sites around the world, mostly making use of relational database management systems (RDBMS), but some using astronomer-written software.  In principle many of the data mining functions that we need are provided by standard RDBMS.  This branch of data mining is sometimes called Knowledge Discovery from Databases (KDD) but this is an awkward phrase (or acronym) and not used very much in scientific contexts. 

The technology of relational database management systems (RDBMS) is mature and in the commercial world such systems routinely handle huge datasets.  In the 2001 VLDB survey, according to www.wintercorp.com the largest commercial databases had 250 billion rows and used 10 TB of disc.  This is one to two orders of magnitude beyond the largest astronomical tables at present, although we this gap may be closing. It is notable that the DBMS supporting the three largest transactional databases listed was IBM's DB2; for decision support a variety of different products was listed, including Oracle and Informix. The latter is now also an IBM product, but it competes directly with DB2 so its future is somewhat uncertain. Academic and research sites in the UK can get free licences for DB2 under IBM's Scholar's Program. We also wanted to assess two open source DBMS are widely used in the astronomical community: MySQL and Postgres. All three products now support the indexing of spatial data. Besides these general purpose DBMS, there are several products on the market specifically designed for data mining in the commercial world.  We investigated several of these briefly, but found that most are designed to mine records from transactional systems, and have few functions useful in handling scientific data.  

3. Cross-matching and Data Federation

Many useful investigations can be carried out on single catalogues, although the teams producing each one are likely to have a head start on such investigations. Results of much greater scientific value are likely to arise from the combination of two or more catalogues, by cross-matching them, that is identifying the same sources of radiation in different catalogues. If the two catalogues cover different wavebands then you immediately get additional spectral information; if they cover different epochs then you may get useful information on source variability, or on proper motion or orbital motion.

In database terms, cross-matching (or cross-identification) involves a sequence of spatial searches or a spatial join between two (or more) tables. The spatial join, therefore, represents an essential basic operation in astronomical KDD. Unfortunately commercial DBMS are not generally designed to do spatial joins. Some of them provide (or can have added) a spatial indexing system; an alternative is the pixel-code (or pcode) algorithm which I described in an earlier report on SkyIndexing which uses a 2-d to 1-d mapping function and should work on any common-or-garden RDBMS.

Ideally one would like to perform these spatial joins over a federated database system, that is by making the tables concerned look as if they were part of a single unified system, while actually being resident at different sites. The difficulties of doing this are discussed in more detail below. An obvious alternative is to set up an astronomical data exploration facility (ADEF) or astronomical data warehouse: a repository for all the tables which are needed in a given data mining investigation.

4. DBMS Investigations

The work carried out during the period 2003 July to September was to study two main questions:

  1. Which of the two algorithms for carrying out spatial joins works best on large datasets
    • join based on a spatial index such as an R-tree
    • the pixel-code algorithm.
  2. Which DBMS performs best on real astronomical data. The short-list from our earlier investigations was:
    • Postgres: an open-source object-relational DBMS with spatial indexing based on R-trees and a reputation for good scalability and robustness. 
    • MySQL: a light-weight open-source relational DBMS, with a reputation for fast querying. R-tree indexing has been introduced in the latest alpha-release. 
    • DB2 from IBM: an "enterprise level" commercial object-relational DBMS, which can be used free for academic and research purposes. It supports spatial indexing but does this with a multi-stage gridfile, not an R-tree. DB2 supports advanced facilities include federated and distributed databases, and can utilise parallel hardware.
At present MySQL is probably the DBMS most widely used by astronomers, but Postgres appears to be gaining in popularity, especially by those handling large datasets. Although DB2 has a large commercial base, I do not know of any use for astronomical data.  

It is important to note that while the DBMS is likely to be an important element in a data mining system, other packages will also be required.  Factors of interest therefore include:
  • Ease of importing data - because all DBMS have their own (often unpublished) internal formats, and none of them can handle tabular data formats used by astronomers such as FITS tables or VOTable, so some data conversions are inevitable.
  • Ease of exporting data - to statistical or visualisation packages, since DBMS have a limited repertoire of statistical functions, and not even basic graphical facilities.
  • Quality of documentation and software support
  • Ease of installation, configuration and use of the package are all important, as few data centres have teams of database administrators or experts on hand.
  • Conformance to existing standards
  • Performance not only on spatial searches and joins, but also on operations involving data loading and sequential scanning, since not all ad-hoc queries will be able to take advantage of indices.

It should be noted that AstroGrid's remit is essentially that of providing infrastructure for data mining and the handling of large datasets. This has been interpreted to mean software rather than hardware or network links, which are the responsibility of other bodies. We expect that the various astronomical data archive centres will continue to make their datasets available, and that many of them will also chose to provide data mining facilities, at least for their local users, and perhaps open to everyone. The main role of AstroGrid, therefore, is to recommend the software packages that such data centres should install, and to develop additional software where necessary. Such software will be freely available. I expect that additional data mining centres will be set up to support particular groups of users; AstroGrid will also want to set up one or more data mining centres in order to test and demonstrate its software.

5. Cone Search

5.1 Scientific Requirements

The term "cone search" is a short-hand term for the searching one or more source catalogues for sources near a given celestial position, Measured positions always have errors, usually but not always about the same in both coordinates, so one needs to specify not only the position of interest (RA, DEC) but also a small search RADIUS. This effectively searches a cone in space extending from the observer out to the edge of the observable universe, or infinity, whichever comes first. There is one minor complication: the most commonly used astronomical coordinate system is locked to the Earth's orbit, which precesses slowly, so a complete positional specification should include the coordinate system equinox: common choices are B1950 or J2000. Many services allow the user to choose the desired equinox, and do the relevant coordinate conversions on the fly. Another valuable feature is that some services allow you to search a small radius around a named source, and will use an astronomical name resolver such as SIMBAD or NED to translate the name to a pair of celestial coordinates.

The principal criterion by which two source records can be matched is the approximate position, but it is not always sufficient. For example, in the case of X-ray source positions which have fairly large error-circles and usually rather faint optical counterparts, one may find several optical sources within the error-circle. Other parameters, typically spectral information, are then needed to resolve the ambiguity. This may be an iterative process: after a reasonable number of sources has been successfully matched, one gets an idea of the properties of a particular class of sources, perhaps making it easier to resolve ambiguities in the remainder.

Some accurate positional work might need to take into account proper motions of stars in space: a number of catalogues, including USNO-B, contain proper motion data on many of the nearby stars, so that by specifying the date of interest, one can precess the catalogue positions to the required epoch. Vizier is able to do this, when the information is available in the right form.

The cone search is just about the simplest operation that we need to support, indeed at first sight it may appear trivial, but some of the issues are worth exploring as they become more acute in case of cross-matching lists of sources.

5.2 Existing Services

The SIMBAD service of CDS was established over 20 years ago but supported from the outset a form of cone search for bibliographical information. With the arrival of the Internet, many observational archives came on-line, usually allowing one to log in to a captive account via telnet and perform cone searches on their data holdings, often including relevant source catalogues. Early examples included LEDAS at Leicester and HEASARC at NASA/GSFC. As soon as the world wide web was invented, these services and others quickly adopted interfaces driven by HTML forms, with CGI connecting to a back-end running a database management system of some sort. These services all required the same three primary parameters for a cone search: RA, DEC, and RADIUS, but there were many differences of detail. Angles might have to be given in decimal degrees or sexagesimal form, the delimiters might be spaces, commas, or colons, and the radius might have units of degrees, arc-minutes, or arc-seconds.

The first attempt to bring order to the chaos was the Astronomical Server URL specification (ASU) published by CDS in 1966 - for details look at http://vizier.u-strasbg.fr/doc/asu.html. CDS also devised the GLU software package which could be used to translate a standard cone-search query into the syntax, coordinates, and equinox required by each of a number of well-known data archive centres. This made it possible to send a single query to a number of servers: the Astrobrowse facility of HEASARC and GSFC is the best-known service making use of this ad-hoc standard, which many services have now adopted. Using Astrobrowse, one can issue one cone search query, and wait for responses from a dozen or more archive sites, which are queried in parallel.

More recently the NVO project has released a definition for a "Simple Cone Search", to which a number of US-based archive sites now conform. This specifies a Web Services interface, with data returned as a MIME-encapsulated VOTable. A description and registry can be found at http://us-vo.org/metadata/conesearch/. Unfortunately, and for reasons which remain obscure, the main query parameters are incompatible with the existing ASU specification, and it omits the equinox parameter, which forces everyone to assume J2000.

It is also worth noting the Vizier service of CDS: this contains something like 10,000 tables, many of them source catalogues. A single cone search can be specified to be performed on one or many of these catalogues.

A major limitation of all current services is that there is no attempt to merge or combine the responses which appear as separate tables. This issue is covered in a separate report DataFedeReport .

5.3 Implementation

For small tables efficiency is not important, since a response within a few seconds is probably adequate, so a brute-force search of the entire table is a feasible implementation. For tables of more than a few thousand rows, however, some index assistance is desirable. Since one is doing a spatial search, a spatial or two-dimensional index is desirable, but support for a spatial index is still uncommon in mainstream DBMS. Many services, however, use no professional DBMS at all but rely on astronomer-written software. Perhaps the most venerable program still for this purpose in use is the BROWSE program written around 1980 in support of ESA's database of results from the EXOSAT observatory. It is still in use at HEASARC and LEDAS and maybe other sites, but likely to be retired fairly soon.

WCSTOOLS, a package from the Harvard-Smithsonian Center for Astrophysics, is also used by many archive sites. This is a wonderfully pragmatic piece of software which supports access to a number of well-known source catalogues in their original distribution format (or at least an uncompressed version of this): supported formats include those of USNO, GSC-II, 2MASS, Tycho, ACT, SAO, IRAS, and PPM catalogues. Essentially the WCSTOOLS scat program has a back-end matching the schema and structure of each of these, as well as its own (even more gloriously ad-hoc) formats for textual and binary data. The only accelerated access method used in WCSTOOLS is slicing and sorting. For example 2MASS is sliced into separate files each covering a strip of declination 1.67° high, and USNO-B1.0 is sliced into strips 0.1° high. Within each strip the rows are sorted by Right Ascension. This simple structure permits a surprisingly rapid access.

Although sorting works well for small tables, it does not scale very well. Access to a given row in a table of N rows takes typically log2N disc seeks, whereas using a B-tree holding say 100 rows per bucket will only require log100N seeks: for a table of a billion rows this the B-tree is likely to be around six times as fast. Sorting is only applicable to read-only tables, perhaps why no professional DBMS supports it, and of course only works in one dimension.

MySQL, Sybase, and SQL Server appear to be the most popular DBMS used for astronomical data archives, but as far as I know all sites using them index their source catalogues along one coordinate axis only. An index on RA is generally more efficient, as sources in the sky will have a more uniform distribution in RA than in DEC. On the other hand the wrap-around of values at the zero-point of the RA scale is a serious complication. Many services have therefore chosen to use their single index along the declination axis. This is obviously not very efficient, as the index is effectively selecting not a small patch of sky but a strip around the entire 360° strip of RA.
The effect in practice is shown by the following table of results, obtained on the whole 2MASS catalogue of 470,992,970 rows and 60 columns, ingested into Postgres. Cone searches covering an area of 0.01 and 2.0 square degrees were carried out, when the table was accessed by either a B-tree on declination, or an R-tree on both RA and declination. The elapsed times are in seconds.


Indexing method for cone search 0.01 square degs 2.0 square degrees
Declination index, B-tree 156.7 secs 1639.7 secs
RA+declination index, R-tree 1.2 secs 2.2 secs

The superiority of the R-tree is overwhelming on a table of this size. The R-tree access times reduce further, to a millisecond or so, when the same (or a similar) query is repeated, as the required index blocks are cached in memory, whereas repeated B-tree accesses require scanning rather a substantial fraction of the original dataset, and these hardly speed up at all.

A spatial index is not only beneficial in improving performance, but also provides a powerful tool which would allow astronomers to select sources according to the projected sky density in the area of interest. This may be necessary to find the probability of false-alarm in identifying sources from one table to another. It also allows one to answer questions like: "give me a list of quasars located in areas where their density is larger than 15 objects per square degree".

5.4 Query Language

All modern DBMS speak nothing but SQL, in theory a standard query language, but in practice a collection of dialects. The SQL needed for a cone search also varies depending on whether a particular DBMS uses simple B-tree indexing, or an R-tree. In the latter case there are no widely accepted standards for the notation. In both cases the first step in performing a cone search is to use the index to select a box-shaped patch of sky drawn around the circle. Suppose we are searching for a cone specified by (cra, cdec, cradius) then the exscribed box will have coordinate ranges of approximately

ramin = cra - cradius/cos(cdec)

ramax = cra + cradius/cos(cdec)

decmin = cdec - cradius

decmax = cdec + cradius

A slightly more messy bit of trigonometry is needed near the poles, where the scales are more severely distorted, or the areas are larger.

The basic SQL for the search for a table indexed with a B-tree can just specify the range in each coordinate and let the SQL optimiser work out which one can be index-assisted:

   SELECT * FROM table WHERE ra BETWEEN ramin AND ramax AND decl BETWEEN decmin AND decmax ;
whereas for an R-tree assisted table in Postgres, one has to specify the overlap of spatial rectangles:

   SELECT * FROM table WHERE tbox && BOX(POINT(ramin,decmin), POINT(ramax,decmax)) ;
In both cases this selects points within a rectangular box. Some additional trigonometry is needed to select just those points within the circle. The simplest great-circle distance function must not be used, as its numerical accuracy falls dramatically with small angles; a better formula involves haversines. Unfortunately this involves finding the square of a sub-expression, and Standard SQL is remarkably defective in having neither an operator nor a function for finding the square (or any other power) of a numerical value. All actual DBMS remedy this defect, usually with a function called POW or POWER or SQUARE. The SQL below assumes the table has celestial position columns called ra and decl (because dec is a reserved word in SQL), and that we are using Postgres, with an R-tree index on a column called tbox, and that all coordinates are in radians. This simplifies the statement to:

   SELECT * FROM table WHERE tbox && BOX(POINT(ramin,decmin), POINT(ramax,decmax)) AND
ASIN(SQRT(POW(SIN(RADIANS(decl-cdec)/2),2) + COS(RADIANS(cdec)) *
COS(RADIANS(decl)) * POW(SIN(RADIANS(cra-ra)/2),2))) < cradius ;
Since SQL is so poorly standardised, distributing actual SQL around a number of sites is unlikely to be feasible. Clearly what is needed is a standard notation for a cone search, maybe a simple function such as cone(ra, dec, radius) or perhaps with the equinox as a fourth parameter. This could be implemented at each site either by translating this pseudo-SQL to the actual SQL required locally, or by implementing suitable user-defined functions. This is also an obvious way of overcoming the problem that different catalogues will use different names for the RA and declination columns.

Given suitable standards for the query language and for the network interfaces (such as ASU or the NVO's "Simple Cone Search", the distributed cone search should present no difficulties, as far as the input is concerned. The problem of federating the output are more formidable, and discussed in DataFedeReport.

6. Catalogue Cross-matching

6.1 Scientific Requirements

The most basic requirement is that of cross-matching all sources in one catalogue with those of another (for convenience we shall assume that catalogue2 is the larger one). The basic service should do this on the basis of approximate position match, using columns of RA and DEC in each table (assumed using the same equinox, or suitably precessed). The absence of a matching source in catalogue2 is of scientific interest in many cases: this corresponds to an OUTER JOIN in database terms. There are perhaps two rather different practical cases to consider:
  1. The user has a smallish source list which is to be cross-matched with one or more standard catalogues.
  2. The user wants to cross-match one standard catalogue with another. These may be resident on the same server, or not.

6.2 Current Services

Only a few public cross-matching services are currently available, and these have significant limitations. This is a list of all those that I know about, but is probably not exhaustive:
  • The Vizier service of CDS allows a user's own list of sources to be uploaded and cross-matched with any reasonable selection of the source catalogues resident there. The input file format is free-format text; there is a good choice of output formats including HTML and all three types of VOTable. There are limitations on the number of catalogues and the total number of rows output, designed to prevent any one user hogging the resources. A more serious limitation, in my opinion, is that if you upload a list of S sources and cross-match with T tables, you get a total of (S × T) separate tables returned; if you want to merge them you have to do it yourself.
  • The Astrobrowse service of HEASARC has a small collection of standard catalogues which can be cross-matched with HST and CFHT catalogues.
  • The SkyQuery service at JHU allows SDSS results to be cross-matched with 2MASS and/or FIRST catalogues. This is a true distributed join, but is feasible because it only allows sources from a small area of sky to be joined.
  • Several on-line catalogues include the results of some cross-matching exercises carried out during production. A recent example is the 1XMM catalogue giving results from the first year's serendipitous detections in fields of XMM-Newton: the X-ray positions were cross-matched with several well-known standard catalogues. The version of the catalogue from ESA/Vilspa contains the information in one wide FITS table: the Vizier version remotes the cross-match information to a separate table.

6.3 Implementation Options

One way of implementing a spatial join is just to read through table1 sequentially, and for each row do a cone search on table2. This works, but is very slow. A typical cone search service takes a second or more per search, and given the overheads of setting up each search, it is unlikely to get much faster. To get higher performance one can import both tables into a spatially-enabled DBMS and perform a spatial join. I have found that with suitable indexing a spatial join can be done at a rate of over 10,000 rows/second which makes joins on small tables take just seconds; even at this rate joining a table of the size of 2MASS (nearly 500 million rows) would take most of a day.

Work that I carried out last year showed two possible algorithms for the spatial join: the pixel-code algorithm which utilised a mapping of spatial coordinates to numbered pixels, and one based on a true spatial index such as an R-tree. The preliminary results on smallish datasets suggested that the pixel-code method was faster, but there were doubts about the scalability because it required a SELECT DISTINCT stage, which involves sorting the output rows, which does not scale linearly with dataset size. A third possibility which I wanted to investigate was using just a simple one-dimensional index on, say, declination, which works fairly well for the cone search on small tables.

6.4 One-dimensional Indexing

All DBMS seem to be able to build B-trees on floating-point data types (real or double precision) and use them efficiently to handle range queries, like this:

   SELECT * FROM table WHERE ra BETWEEN 123.45 AND 123.47  AND ... ;
This could be the basis of a simple cone search, with tolerable efficiency on small tables. I thought that it might be possible to perform a join on a similar basis. But the first part of the equivalent join condition has to be written something like this:
   SELECT * FROM t1, t2 WHERE ABS(t1.ra - t2.ra) < maxoffset ;
Somewhat to my surprise, even when indices were available for the both declination columns, not a single optimiser was able to make use of them to perform an indexed join. All of them would have gone on happily to do a brute force nested-loop join, but this would have taken some weeks.

Many DBMS support the creating of an index on two or more columns jointly (e.g. on colour and make of a car) so that queries which specify both values can use the combined index to find the relevant row. This method is used by the SDSS team using SQL Server (which has no true spatial index) to speed up their cone searches. Clearly you cannot combine two floating-point columns in one index, but it is possible to produce an index on say FLOOR(DECL) combined with RA. The effect is much the same as that of dividing the sky into declination strips, as in USNO-B accessed by WCSTOOLS. Unfortunately all the DBMS at my disposal also refused to use such combined indices in spatial joins.

It is, of course, possible to reduce both RA and DEC to integers, and produce a combined index on this integer pair. If you just want an exact match condition, the join can indeed be performed at high speed. Unfortunately this does not quite solve the scientific problem: wherever you put the boundaries between one integer value and the next, some sources will have error-circles which overlap two integers. If you want to process these cases properly, you get involved in a much more complex solution, since one source corresponds in general to more than one pixel or integer pair. In fact this reduces to the pixel-code method, in which pixels are simply numbered somehow, and have a standard B-tree index built on them.

6.5 Spatial Indexing in Postgres, MySQL, and DB2

These three DBMS were chosen because they all had a spatial data support built-in, or available as an option. A full report is provided separately, but basically my findings were as follows:
  • Postgres has R-tree indexing built-in. This was well-documented and easy to use. The performance was good. I found no serious problems in using Postgres, although as with all DBMS there are some quirks of which users have to be aware.
  • MySQL has R-tree indexing supported only in the latest Alpha release. It is designed to be compatible, eventually, with the specifications of the OpenGIS consortium. Unfortunately these are not very well aligned with our requirements. I found that loading spatial data was very difficult and slow, and that the spatial indices were much larger than those of Postgres, as the simplest spatial object we can use is a set of five points specifying 4-sided polygon. My tests showed the performance to be much worse than Postgres, and I have subsequently seen a newsgroup report giving similar results. This is not surprising since spatial indexing has only just been added to the DBMS. MySQL has a number of other limitations compared to Postgres, and overall is not favoured.
  • DB2 (from IBM) is a mainframe database now ported to Linux. It has a Spatial Extender option based not on R-trees but on a multi-level grid file. This has to be installed as a separate package, complete with a 572pp manual. Overall I found DB2 quite the most user-hostile piece of software that I have used for many years. Because it took me so many weeks to get data loaded and basic operations working, I did not discover until a late stage that I lacked instructions on installing the necessary licence for the Spatial Extender, and I eventually ran out of time. It is a pity that I was unable to test it, but I suspect that the Spatial Extender would not have been as good as the simple R-tree index of Postgres. The multi-level grid file needs to be matched to the problem domain rather carefully, as it is not self-adjusting to the data range like an R-tree. It also requires a polygon to be defined, with a rectangular box (requiring 5 points) probably the simplest which suits our needs. DB2 has a circle data type, but it is apparently implemented as a polygon of 97 sides, which is unlikely to be processed very efficiently. It is also interesting that Informix (now owned by IBM) supports R-trees, and I wonder if this will be retro-fitted to DB2 if the products merge.

6.6 Pixel-code versus R-tree

Because of the poor performance of MySQL, and the lack of time to get DB2 Spatial Extender working, I was able to compare these two algorithms only using Postgres. The table below shows the time taken to perform the cross matching of a variable number of strips of 360° RA by 0.1° declination of USNO-B with a chunk of the 2MASS catalogue covering 360° in RA by 3.5° in declination, and with the declination ranges overlapping. The times did not include the time needed to ingest the raw data, but did include time to create the necessary indices on the first table, and the join itself. All times are in seconds of elapsed time.

Number of 0.1° strips Pixel-code algorithm R-tree indexing
1 88 secs 54 secs
3 406 secs 161 secs
9 1150 secs 553 secs
27 4623 secs 1964 secs

It is clear that the R-tree method is faster, and that its advantage increases with dataset size. The pixel-code algorithm is more complicated, involving the creation of additional tables, and a final 3-table join. Note that a typical strip of USNO contains 637,000 sources, so that the R-tree method is joining them at a rate of around 10,000 sources/second. This seems a reasonably good speed. It means that anyone arriving with a new table of modest size, such as the 1XMM catalogue with its 50,000 sources, can expect to be able to cross-match it with a standard catalogue in just a few seconds. The time taken should vary only slowly with the size of the second table, as it is accessed via an index,with a time probably proportional to something like log100N where N is the number of rows.

6.7 Distributed Join

We would really like to know whether it is feasible to carry out a distributed join, that is to join two tables which are not installed on the same computer. Large astronomical source catalogues are found on servers all over the world and it is obviously much simpler if they can be used in-situ. This section outlines a number of possible ways forward:
  • Perform the join as a sequence of cone searches: this will obviously work, but it may not be possible to increase the speed much over one row per second, which will be unacceptable in most cases.
  • Send the smaller table across the network by FTP (or gridFTP or whatever) and join it there. This is not quite a distributed join, but is likely to be one of the most efficient solutions. The basic problem is that a join requires most of the rows from table1 (all the data if an outer join is required) and at least a similar number of rows extracted from table2.  The user generally wants most or all of the columns from both tables to be present in the output, so a substantial volume of data, perhaps amounting to the whole of one table, will have to be moved from one system to the other.  If this is done in small chunks under the control of the DBMS it is likely to be much less efficient than if performed in transfers optimised for bulk data flows, such as FTP.
  • Use a DBMS which supports distributed tables, such as Oracle or DB2. I have not had time to try this out using DB2 (it needs yet more optional packages to be installed and licenced) but there seem to be a number of limitations. The facility is really designed for use in a company intra-net, and requires a considerable degree of control over both local and remote DBMS. I can find no information on the IP ports used, so have no idea whether it will work through typical campus firewalls. All that I have read so far suggests that it is an advanced option which needs an experienced database administration team some time to implement. The performance will, obviously, be limited by the network, and it may be that the latency (the round-trip time for messages) is as important as the raw bandwidth.
  • The Distributed Query Project (DQP) of the OGSA-DAI project http://www.ogsadai.org.uk/dqp/ is interesting and worth further investigation.  Potential problems include the query language which is OQL not SQL, the only join supported at present is an equi-join, and it is built upon JDBC which has no specific provision for spatial indexing.
  • A package from INRIA called C-JDBC supports RAID-B (redundant array of independent databases) also performs distributed querying using JDBC. This may also merit further study.  For further details see http://c-jdbc.objectweb.org/current/doc/userGuide/html/index.html

My guess is that it should be feasible to join small tables over the network, but it will not be possible to extend this to large tables.  But we have little idea, yet, what the size limits are.

Let us assume that a distributed join is not generally feasible, and that some sort of data mining centre or data warehouse or astronomical data exploration facility (ADEF) has to be set up. The next few subsections examine the most important cases. 

6.9 Joining a user-generated table with a standard catalogue

Upload: the first step is that of uploading the user's table to some suitable server holding the standard catalogue(s). The main problem here is to choose a suitable file format. Possible choices are:
  • VOTable - the politically correct choice. But so far barely used in the astronomical community and I know of no utilities to create them. Once you have managed to create a VOTable file, the Starlink package TOPCAT could be used to display and edit the contents. Another problem is that the most efficient of the three format options has the data in a separate FITS file: this means there are two files to upload. If VOTable were to be used, then some additional software would be needed to convert it to some format, like CSV, that a DBMS knows how to ingest.
  • FITS - the format currently most used by astronomers for tables, because generated by programs like SEXTRACTOR, and down-loadable from so many archive sites. And there are many tools, such as fv and FTOOLS, to manipulate them. Again if used for upload, the DBMS wrapper would have to be provided with a converter to CSV. For my own tests I produced a prototype which generates an SQL create statement and a CSV file, but this is somewhat DBMS-specific.
  • CSV (character-separated value, or maybe comma-separated value), or TSV (tab ditto) - this is not so much a standard as an ad-hoc "obvious" way of transferring files. Problems include:
    • Which field delimiter does one use: comma, tab, and vertical bar are common choices?
    • The record delimiter is usually a newline, but of course this differs between Unix/Linux and MS-DOS/Windows. Macs used to be different from both, but maybe with their latest O/S based on Unix, this has changed?
    • How are nulls represented: an absence of a value sometimes works, or there may be some specific marker (both MySQL and Postgres accept \N but will allow this to be changed)?
    • Do string fields have to be enclosed in a pair of quotes or apostrophes, or can they be so enclosed if the field itself contains a field separator; if not how do you handle fields which happen to contain field separators?
    • How are date/time values represented?
Unfortunately, nearly all DBMS have no usable binary format for import, so some sort of CSV file is the only realistic option. This loses all metadata, and needs some conversion software yet to be written.  The conversion program can, of course, also create a CREATE TABLE statement, which will ensure the column names and data types are correct and aligned with the data file.  In principle it could also create a metadata table to go alongside the data table, which could keep track of physical units or other information about the substantive table and its columns.  Whether such a system could be used consistently has not yet been explored fully.

Create table, ingest: these stages require the user to have write privileges on the DBMS, which will in general need support by the authentication and authorisation system of AstroGrid. It is then necessary to create a column of spatial data (a box for the R-tree case) and issue an UPDATE statement to populate it from the position and error-circle columns. This also needs write access to the table (but perhaps MySpace can take care of this).


Join:
this operation is then fairly straight-forward on Postgres using R-trees, although there are lots of options that the user has to specify, such as: which columns to be produced in the resulting table, do these included computed quantities such as position offset between the two tables, and is an inner or outer join to be performed.

Export results: A more serious choice has to be confronted over where the results should go, and in what format. Again the obvious possibilities are:

  • Leave them as a new table on the DBMS: this is probably the most useful option, as this allows the user to perform further DBMS operations on the results.
  • Generate plain text results with minimal formatting: this is the default for all DBMS, and may be useful for testing.
  • Generate an HTML table which can be directly displayed on a browser: this is conventional for current on-line DBMS, but not really practical for large tables, or for ones generated non-interactively.
  • VOTable: this needs the output from the DMBS to be processed into this form, and the user needs to choose where the VOTable should be sent, or be told where to pick it up from.
Clearly the MySpace system should be able to handle the last option, but it is not clear to me how or whether it can be involved if the data are left in the grasp of the DBMS. Even if AstroGrid recommends a particular DBMS, we shall have to accept that for the foreseeable future the data mining sites (or whatever we call them) will actually be run by a variety of different DBMS, and they all have their own idiosyncratic ways of authorizing users, ingesting data, and exporting results.

Merge results: if the user wants to cross-match with two or more standard catalogues, we have the problem of merging the output tables into some more digestible form; this is explored in a separate report.

Clean-up: the resources used for the data ingest, the indices, and the joined results presumably need to be reclaimed after some suitable interval. This, in principle, is another job for MySpace.

6.10 Joining one standard catalogue with another on the same DBMS

This is actually a simpler case than the previous one, assuming that both standard catalogues are tables resident in a spatially-aware DBMS such as Postgres, with suitable box columns generated, and indexed. The join operation presents no problems, except that it may take some time. The results will usually be much larger in volume, and the export and clean-up problems are as above. The main problems will probably be administrative, since a much larger allocation of cpu-time and disc space will be necessary.

This is, perhaps, rather an extreme and over-simplified case.  A one-to-one source match is not always what one desires from a cross match. One may well want to list all galaxies located less than 1 degree from the centre of a list of Abell clusters.  In addition one does not always want to cross-match the whole of a catalogue: a more likely case is that of selecting a particular region of sky (say high galactic latitude) or a particular class of object, and then joining that subset with another catalogue.  Whether such a join can be done most efficiently by using the spatial index or not depends on the particular circumstance.  In principle the query optimisers of major DBMS are becoming quite good at working out the best execution plan, but sometimes they need help.  An EXPLAIN command is often provided so that one can work out how the DBMS is going to do an operation in advance.

6.11 Joining one standard catalogue with another located elsewhere

The problems of doing this were explored in section 6.7 above. At present the only option that we know will work is to physically copy the required catalogues to the same server, using some FTP-type service, ingest it, and create the required spatial index. The problem then reduces to that of section 6.8.2.

The majority of standard catalogues are, in principle, available to anyone - that may, indeed, be a definition of a standard catalogue. In practice, though, there may be a number of hurdles to overcome. The physical size of the table is just one of them: network transfers currently start to be difficult beyond about a few hundred gigabytes. In principle, with faster and more reliable networks, it should not be difficult to transfer terabytes of data over the UK and European research networks. Perhaps we can benefit from the work that the particle physics community have put into fast bulk data transfers.

A second major problem is that the astronomical community has not come up with anything resembling a "standard" format for the distribution of catalogue data. In my tests over recent months I have grappled with the following:
  • FITS ASCII tables, with each table covering one small tile on the sky, typically a few degrees across. The HST Guide Star Catalog is distributed on CD-ROM in this format.
  • FITS binary table, with one big file covering the whole sky. The 1XMM catalogue from the XMM-Newton Survey Science Centre is distributed from Vilspa in this format: the table has around 50k rows and 400 columns.
  • Specially-invented binary format, with the sky divided into narrow declination strips, and data sorted into increasing RA within each strip. The USNO catalogues are distributed (on hard disc) in this way, but the format of USNO-A2.0 and USNO-B1.0 is quite different, requiring different software.
  • Compressed CSV files, with the sky divided into narrow declination strips, and data sorted into increasing RA within each strip. The 2MASS catalogue is distributed (on a set of DVDs) in this way.
  • Catalogues can be downloaded from Vizier in a variety of formats: for large tables it is recommended that one uses a plain ASCII fixed-format file, FITS (ascii), or VOTable (with the actual data in FITS).

7. Exploratory Data Analysis

Before data can be mined on a large scale, it is usually necessary to examine its properties, check quality, weed out the rogue data points, and so on. These activities are so common that the acronym EDA is well established in statistical circles.

7.1 Operations feasible with a DBMS

Some of the more basic activities are reasonably well supported by the relational DBMS, for example:
  • Select a subset using an expression on some of the columns, e.g. to select part of a brightness distribution, or rows just at some particular galactic latitude range.
  • Sort the results in ascending or descending order, and maybe present just the top N values.
  • Create and populate a new column, such as hardness-ratio computed from fluxes or magnitudes, or perhaps precess positions to a new equinox.
  • Select a smaller randomly-chosen sample for further work.
  • Compute simple statistics, such as mean, variance or standard deviation.
  • Compute a histogram of values (using the GROUP BY function of SQL).
Some of these are not done very well: for example the simple GROUP BY function will omit from a histogram an ordinate where no points are found to lie, which can be confusing. More advanced statistics, such as finding a median or other quantile, can be done with difficulty in SQL, but efficiency is low compared to custom code. Some of the missing functionality can probably be restored with a few user-defined functions: functions for coordinate conversion, precession, proper motions, and statistical tests are obvious candidates. A more serious problem is that most of the preceding operations require a sequential scan of the entire table. This issue is examined further in the next section.

7.2 Other DBMS Issues

SQL is really not designed for exploratory data analysis, as its design is based on the principle of stating the end result, and letting the DBMS get on with it. The default destination for results is a text representation sent to the user's terminal, to issue a query with the results saved as a new table takes extra work. For EDA what one really wants is a system which automatically saves each result as a new table, and uses that by default for the new query. In case this is not clear, here is a simple example. Suppose we are exploring the values at the extreme of some distribution, say over 3 sigma above the mean, one could approach this stepwise:

  1. Compute the mean and standard deviation
  2. Select a subset of points more than (mean + 5 × sigma).
  3. Sort this into descending order
  4. Display the top 10 values.
The simplest way to do this in SQL is as a single query which involves a sub-query. It cannot be done (as far as I know) in standard SQL, but in the Postgres dialect you would issue a query something like this:
   SELECT h_m, k_m, h_m-k_m AS diff FROM tmass WHERE h_m-k_m >
(SELECT AVG(h_m - k_m) + 3.0 * STDDEV(h_m - k_m) FROM tmass)
ORDER BY h_m - k_m DESC LIMIT 10;
Many DBMS support the creation of a view: a virtual table created by some selection on its parent. Operations on such a view will, however, be no faster than on the parent, as (apart from some benefit from disc caching) the view does not really exist, it is just a convenience to simplify subsequent SQL, so views are not really a satisfactory substitute for what is wanted.

The Starlink package TOPCAT supports a number of useful database-like operations including selection of rows, sorting, and plotting, and it does so in an environment in which each operation can be applied either to the base table or the subset produced in the preceding stage. This is a much better match to the requirements of the exploratory data analyst.  


There are various other considerations that we need to take into account in designing a system, for example:

  • We need to handle self-correlation, i.e. joining a table with itself.  This may be useful for quality control or to find density information, e.g. source clustering, or for other reasons.
  • We need to handle extended objects such as clusters of galaxies.  The pixel-code method starts to get very inefficient whenever the typical object covers more than one pixel; R-tree indexing probably gets less efficient as well, but we have not yet measured how much.
  • We need to handle source-dependent radii: for example "find all QSO within 10 galactic radii of all galaxies".
  • We need to handle much larger spatial offsets than just the small errors in positions, for example another realistic query is "find all cases in which there is a bright stars (suitable as a phase reference for adaptive optics) within N arc-minutes of all sources of class X.
  • Given that typical joins are slow, the user should have the ability to estimate the number of resulting matches, or the time likely to be taken in a query, before actually submitting it.  In some cases the indices or the EXPLAINfacility of the DBMS can be used to estimate this.

7.3 Operations beyond the DBMS

Quite a range of other operations of interest to the astronomer, exploring a new dataset, seem to be rather beyond the capability of the DBMS. This is not an exhaustive list, but some examples include:
  • Regression analysis - cross-correlation of one column with another to find association rules.
  • Deviation detection - finding outliers from some association rule (simple outlier detection from the distribution of values in a single column is obviously possible with SQL, as noted above).
  • Sequence analysis - one is not allowed to assume any specific ordering of rows in a relational database, so that most of the operations of time-series analysis need functionality beyond that of the RDBMS.
  • Clustering and classification analysis: one might be able to find simple associations in space by performing a spatial self-join of a table, but more advanced clustering analysis needs specially-written software.
  • Grapical operations, such as plotting one column against another, projecting positions into an image, and all other visualisation operations.
For all such operations it will obviously be necessary to export the data and ingest into another package.  We have not yet had time to examine any of the possible packages in detail.

8. Performance Enhancements

Many queries will require initially a scan of the whole of a large table in order to select a subset of interest according to some set of criteria: - astronomers will want to run a huge variety of different queries often on functions of values in columns.  A very simple example might be:
SELECT * FROM twomass WHERE (hmag - kmag) > (kmag - mmag) AND abs(Galactic_latitude) > 15.0;

Unless indices have been built on expressions such as (hmag-kmag) or (kmag-mmag) or perhaps on GalacticLatitude then such a query will be done by a sequential search.  An index will be worth-while only if the same sub-expression is likely to be used a number of times.  Building an index requires a sequential scan of the whole table, of course.

It is not too difficult to write software to optimise the reading of all data in a single column (or a few).  My experiments last year, reported in DbmsEvaluations showed that even a FITS table could be read at a speed essentially limited by that of the disc I/O channel.  More recently Patricio Ortiz has been writing an experimental system as part of his Datoz which can read a whole column of USNOB in about 90 seconds with one processor, which is a rate around 45 MB/sec.  If we could do that on an 8-way parallel system, that would be reduced to about 10 secs, perfectly acceptable for interaction, if all I want to do is to select sources with Bmag2 > 14. The DBMS under study performed sequential scans substantially slower than expected from the performance of the disc systems.  These figures are not strictly comparable, as different disc systems were involved, and the DBMS were probably not optimally tuned for such queries, but give some idea of the problem.

Software Package Data transfer rate
DB2 8.6 MB/s
MySQL 11.2 MB/s
Postgres 4.6 MB/s
cFITSIO on row-oriented table 7.2 MB/s
cFITSIO column-oriented table 21.2 MB/s
Datoz (column-oriented with data compression) 42.1 MB/s

It is notable that a standard library accessing a FITS table can achieve about twice that of the best DBMS, while custom code (Datoz) can almost reach the speed of the disc channel (around 50 MB/s).

There are several ways in which we might be able to obtain better performance:

  • Tuning existing DBMS packages better.
  • Distributing data over many nodes in a cluster.
  • Using column-oriented storage.
The major DBMS products such as DB2, Oracle, and Sybase suggest that the best performance can be obtained by allowing the DBMS access to the raw disc partition, as this avoids the overhead of the operating system and file structure.  We were unable to test this with DB2, as it would have caused too much disruption to a system used for other purposes.

With mostly read-only tables, there could be a considerable speed advantage if sequential scan queries were spread over a number of nodes in a cluster. Only a few DBMS seem capable of doing this, among them DB2.  Even here there are some limitations, essentially because these DBMS are designed for handling transactions, which need a lot more locking and synchronisation when performed over a distributed system. Both MySQL and Postgres have some facilities for database replication, essentially in order to increase robustness. It seems possible that we could split our tables arbitrarily between separate instances of, say, Postgres, and send appropriate queries to several nodes in parallel, and combine results. At present, though, there do not seem to be any facilities built-in to make this easy and considerable experimentation would be needed to pursue this further.

It is not too difficult to write software to optimise the reading of all data in a single column (or a few).  My experiments last year, reported in DbmsEvaluations showed that even a FITS table could be read at a speed essentially limited by that of the disc I/O channel.  More recently Patricio Ortiz has been writing an experimental system as part of his Datoz which can read a whole column of USNOB in about 90 seconds with one processor, which is a rate around 45 MB/sec.  If we could do that on an 8-way parallel system, that would be reduced to about 10 secs, perfectly acceptable for interaction, if all I want to do is to select sources with Bmag2 > 14.

When looking at commercial data mining products I came across a couple of products using column-oriented data storage. Normally a DBMS stores data in a row contiguously, as this makes updates and deletions faster, but it means that if you want at least one item from every row, you have to read the entire table from disc. For essentially read-only data, and queries which involve only a subset of the columns in a table, column-oriented storage makes good sense: it means that the only data blocks you have to read from disc are those holding data from the required columns. Since many astronomical tables have 30 to 50 columns, and one has more than 400, this can be a significant gain. The two packages I have discovered so far are: Sybase-IQ and Kdb.

Sybase-IQ is a product of the Sybase company, but entirely separate from their relational DBMS, though it seems to accept the same dialect of SQL. Jim Lewis of IoA Cambridge has had a copy for evaluation, and found it to perform well in the type of queries which do indeed involve just a few columns. Exact benchmarks are not available, because of hardware differences. For some types of exploratory analysis it seems likely that Sybase-IQ would be useful. At present there are two obstacles: firstly it does not yet run on Linux at present (though a port is in planned), the second is the high licencing cost, since the list price is over £30,000 pounds per node. 

Kdb comes from a company called Kx Systemsand all I know about it is what can be seen on their web-site.

Many DBMS support the creation of a view: a virtual table created by some selection on its parent. Operations on such a view will, however, be no faster than on the parent, as (apart from some benefit from disc caching) the view does not really exist, it is just a convenience to simplify subsequent SQL.

9. Conclusions and Future Work

Some conclusions from the work so far are:

  • Spatial joins based on the R-tree index are faster, simpler, and more scalable than those based on the pixel-code algorithm.
  • For spatial joins and searches, the facilities of Postgres are easier to use and much faster than those of MySQL. 
  • The facilities of DB2 may be usable, but it seems to take a great deal longer to get such a mammoth product configured and working.  The lack of any representation for nulls in external files seems a major drawback, and in general it is much less suitable for use by scientists than the other DBMS tested.
The plans for future work in this field are not yet decided.  Watch this space.

Author: Clive Page, with additional contributions from  Patricio Ortiz.

-- ClivePage - 26 Sep 2003

-- ClivePage - 1 Dec 2003 (fixed a few typos).

Topic attachments
I Attachment Action Size Date Who Comment
htmlhtml cross.html manage 46.8 K 2003-10-03 - 13:45 ClivePage Report on Cross Matching Catalogues
Topic revision: r5 - 2003-12-01 - 12:10:00 - ClivePage
 
AstroGrid Service Click here for the
AstroGrid Service Web
This is the AstroGrid
Development Wiki

This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback