CrossMatchingReport
Data Federation and Data Mining
1. Project Objectives
The objectives of the
AstroGrid project were summarised in our
original grant application as follows:
- A working datagrid for key UK databases
- High throughput data mining facilities for interrogating those
databases
- A uniform archive query and data mining software interface
- The ability to browse simultaneously multiple datasets
- A set of tools for integrated on-line analysis of extracted data
- A set of tools for on-line database analysis and exploration
- A facility for users to upload code to run their own algorithms
on the data mining machines
- An exploration of techniques for open-ended resource discovery
The importance attached to "data mining" is clear from the fact that it
appears three times in these eight points. Related operations
such as browsing, exploring, and interrogating databases are also given
prominence. More recently the term "data federation" has come to
the fore. This document tries to explain these terms, and report on
the results of our recent research programme.
2. Data Mining and Knowledge Discovery from Databases
Data mining is generally defined as the discovery of information buried
in massive databases. One can mine images, for example, by
developing more sensitive algorithms to detect previously hidden
sources. These are interesting research projects, but have no
particularly strong e-science component, since they mainly involve the
development of specialist software for use on a single computer.
We are more concerned with mining data from source catalogues, which
are lists of sources of radiation (such as stars and galaxies) detected
in images of the sky. In recent years there has been a explosive growth
in thir size; this started some years ago with the digitisation of
archives of photographic plates using high-resolution scanners, and has
accelerated with the use of ever larger CCD detectors on modern
telescopes. The resulting datasets are tabular in form: the longest
have already reached a billion rows, and the widest tables have around
400 columns. Even larger datasets will be generated by telescopes
coming on stream shortly, especially those from systematic sky surveys
such as WFCAM and VISTA. These source catalogues are held on
servers at data archive sites around the world, mostly making use of
relational database management systems (RDBMS), but some using
astronomer-written software. In principle many of the data mining
functions that we need are provided by standard RDBMS. This
branch of data mining is sometimes called Knowledge Discovery from
Databases (KDD) but this is an awkward phrase (or acronym) and not used
very much in scientific contexts.
The technology of relational database management systems (RDBMS) is
mature and in the commercial world such systems routinely handle huge
datasets. In the 2001 VLDB survey, according to
www.wintercorp.com
the largest commercial databases had 250 billion rows and used 10 TB of
disc. This is one to two orders of magnitude beyond the largest
astronomical tables at present, although we this gap may be closing.
It is notable that the DBMS supporting the three largest transactional
databases listed was IBM's DB2; for decision support a variety of
different products was listed, including Oracle and Informix. The
latter is now also an IBM product, but it competes directly with DB2 so
its future is somewhat uncertain. Academic and research sites in the
UK can get free licences for DB2 under IBM's Scholar's Program. We also
wanted to assess two open source DBMS are widely used in the
astronomical community: MySQL and Postgres. All three products
now support the indexing of spatial data. Besides these general purpose
DBMS, there are several products on the market specifically designed
for data mining in the commercial world. We investigated several
of these briefly, but found that most are designed to mine records from
transactional systems, and have few functions useful in handling
scientific data.
3. Cross-matching and Data Federation
Many useful investigations can be carried out on single catalogues,
although the teams producing each one are likely to have a head start
on such investigations. Results of much greater scientific value are
likely to arise from the combination of two or more catalogues, by
cross-matching them, that is identifying the same sources of radiation
in different catalogues. If the two catalogues cover different
wavebands then you immediately get additional spectral information; if
they cover different epochs then you may get useful information on
source variability, or on proper motion or orbital motion.
In database terms, cross-matching (or cross-identification) involves
a sequence of spatial searches or a spatial join between two (or
more) tables. The spatial join, therefore, represents an essential
basic operation in astronomical KDD. Unfortunately commercial DBMS are
not generally designed to do spatial joins. Some of them provide (or
can have added) a spatial indexing system; an alternative is the
pixel-code (or pcode) algorithm which I described in an earlier report
on SkyIndexing which uses a 2-d to 1-d mapping function and should work
on any common-or-garden RDBMS.
Ideally one would like to perform these spatial joins over a federated
database system, that is by making the tables concerned look as if they
were part of a single unified system, while actually being resident at
different sites. The difficulties of doing this are discussed in more
detail below. An obvious alternative is to set up an astronomical data
exploration facility (ADEF) or astronomical data warehouse: a
repository for all the tables which are needed in a given data mining
investigation.
4. DBMS Investigations
The work carried out during the period 2003 July to September was to
study two main questions:
- Which of the two algorithms for carrying out spatial joins works
best on large datasets
- join based on a spatial index such as an R-tree
- the pixel-code algorithm.
- Which DBMS performs best on real astronomical data. The
short-list from our earlier investigations was:
- Postgres: an open-source object-relational DBMS with
spatial indexing based on R-trees and a reputation for good scalability
and robustness.
- MySQL: a light-weight open-source relational
DBMS, with a reputation for fast querying. R-tree indexing has been
introduced in the latest alpha-release.
- DB2 from IBM: an "enterprise level" commercial
object-relational DBMS, which can be used free for academic and
research purposes. It supports spatial indexing but does this with a
multi-stage gridfile, not an R-tree. DB2 supports advanced facilities
include federated and distributed databases, and can utilise parallel
hardware.
At present MySQL is probably the DBMS most widely used by
astronomers, but Postgres appears to be gaining in popularity,
especially by those handling large datasets. Although DB2 has a large
commercial base, I do not know of any use for astronomical data.
It is important to note that while the DBMS is likely to be an
important element in a data mining system, other packages will also be
required. Factors of interest therefore include:
- Ease of importing data -
because all DBMS have their own (often unpublished) internal formats,
and none of them can handle tabular data formats used by astronomers
such as FITS tables or VOTable, so some data conversions are inevitable.
- Ease of exporting data -
to statistical or visualisation packages, since DBMS have a limited
repertoire of statistical functions, and not even basic graphical
facilities.
- Quality of documentation and software support
- Ease of installation,
configuration and use of the package are all important, as few
data centres have teams of database administrators or experts on hand.
- Conformance to existing standards
- Performance not only on
spatial searches and joins, but also on operations involving data
loading and sequential scanning, since not all ad-hoc queries will be
able to take advantage of indices.
It should be noted that
AstroGrid's remit is essentially that of
providing
infrastructure for data mining and the handling of
large datasets. This has been interpreted to mean software rather than
hardware or network links, which are the responsibility of other
bodies. We expect that the various astronomical data archive centres
will continue to make their datasets available, and that many of them
will also chose to provide data mining facilities, at least for their
local users, and perhaps open to everyone. The main role of
AstroGrid,
therefore, is to recommend the software packages that such data centres
should install, and to develop additional software where necessary.
Such software will be freely available. I expect that additional data
mining centres will be set up to support particular groups of users;
AstroGrid will also want to set up one or more data mining centres in
order to test and demonstrate its software.
5. Cone Search
5.1 Scientific Requirements
The term "cone search" is a short-hand term for the searching one or
more source catalogues for sources near a given celestial position,
Measured positions always have errors, usually but not always about the
same in both coordinates, so one needs to specify not only the position
of interest (RA, DEC) but also a small search RADIUS. This effectively
searches a cone in space extending from the observer out to the edge of
the observable universe, or infinity, whichever comes first. There is
one minor complication: the most commonly used astronomical coordinate
system is locked to the Earth's orbit, which precesses slowly, so a
complete positional specification should include the coordinate system
equinox: common choices are B1950 or J2000. Many services allow the
user to choose the desired equinox, and do the relevant coordinate
conversions on the fly. Another valuable feature is that some services
allow you to search a small radius around a named source, and will use
an astronomical name resolver such as SIMBAD or NED to translate the
name to a pair of celestial coordinates.
The principal criterion by which two source records can be matched
is the approximate position, but it is not always sufficient. For
example, in the case of X-ray source positions which have fairly large
error-circles and usually rather faint optical counterparts, one may
find several optical sources within the error-circle. Other parameters,
typically spectral information, are then needed to resolve the
ambiguity. This may be an iterative process: after a reasonable number
of sources has been successfully matched, one gets an idea of the
properties of a particular class of sources, perhaps making it easier
to resolve ambiguities in the remainder.
Some accurate positional work might need to take into account proper
motions of stars in space: a number of catalogues, including USNO-B,
contain proper motion data on many of the nearby stars, so that by
specifying the date of interest, one can precess the catalogue
positions to the required epoch. Vizier is able to do this, when the
information is available in the right form.
The cone search is just about the simplest operation that we need to
support, indeed at first sight it may appear trivial, but some of the
issues are worth exploring as they become more acute in case of
cross-matching lists of sources.
5.2 Existing Services
The SIMBAD service of CDS was established over 20 years ago but
supported from the outset a form of cone search for bibliographical
information. With the arrival of the Internet, many observational
archives came on-line, usually allowing one to log in to a captive
account via telnet and perform cone searches on their data holdings,
often including relevant source catalogues. Early examples included
LEDAS at Leicester and HEASARC at NASA/GSFC. As soon as the world wide
web was invented, these services and others quickly adopted interfaces
driven by HTML forms, with CGI connecting to a back-end running a
database management system of some sort. These services all required
the same three primary parameters for a cone search: RA, DEC, and
RADIUS, but there were many differences of detail. Angles might have to
be given in decimal degrees or sexagesimal form, the delimiters might
be spaces, commas, or colons, and the radius might have units of
degrees, arc-minutes, or arc-seconds.
The first attempt to bring order to the chaos was the Astronomical
Server URL specification (ASU) published by CDS in 1966 - for details
look at http://vizier.u-strasbg.fr/doc/asu.html.
CDS also devised the GLU software package which could be used to
translate a standard cone-search query into the syntax, coordinates,
and equinox required by each of a number of well-known data archive
centres. This made it possible to send a single query to a number of
servers: the Astrobrowse facility of HEASARC and GSFC is the best-known
service making use of this ad-hoc standard, which many services have
now adopted. Using Astrobrowse, one can issue one cone search query,
and wait for responses from a dozen or more archive sites, which are
queried in parallel.
More recently the NVO project has released a definition for a "Simple
Cone Search", to which a number of US-based archive sites now conform.
This specifies a Web Services interface, with data returned as a
MIME-encapsulated VOTable. A description and registry can be found at
http://us-vo.org/metadata/conesearch/.
Unfortunately, and for reasons which remain obscure, the main query
parameters are incompatible with the existing ASU specification, and it
omits the equinox parameter, which forces everyone to assume J2000.
It is also worth noting the Vizier service of CDS: this contains
something like 10,000 tables, many of them source catalogues. A single
cone search can be specified to be performed on one or many of these
catalogues.
A major limitation of all current services is that there is no
attempt to merge or combine the responses which appear as separate
tables. This issue is covered in a separate report DataFedeReport .
5.3 Implementation
For small tables efficiency is not important, since a response within a
few seconds is probably adequate, so a brute-force search of the entire
table is a feasible implementation. For tables of more than a few
thousand rows, however, some index assistance is desirable. Since one
is doing a spatial search, a spatial or two-dimensional index is
desirable, but support for a spatial index is still uncommon in
mainstream DBMS. Many services, however, use no professional DBMS at
all but rely on astronomer-written software. Perhaps the most
venerable program still for this purpose in use is the BROWSE program
written around 1980 in support of ESA's database of results from the
EXOSAT observatory. It is still in use at HEASARC and LEDAS and maybe
other sites, but likely to be retired fairly soon.
WCSTOOLS, a package from the Harvard-Smithsonian Center for
Astrophysics, is also used by many archive sites. This is a
wonderfully pragmatic piece of software which supports access to a
number of well-known source catalogues in their original distribution
format (or at least an uncompressed version of this): supported formats
include those of USNO, GSC-II, 2MASS, Tycho, ACT, SAO, IRAS, and PPM
catalogues. Essentially the WCSTOOLS scat program has a back-end
matching the schema and structure of each of these, as well as its own
(even more gloriously ad-hoc) formats for textual and binary data. The
only accelerated access method used in WCSTOOLS is slicing and sorting.
For example 2MASS is sliced into separate files each covering a strip of
declination 1.67° high, and USNO-B1.0 is sliced into strips
0.1° high. Within each strip the rows are sorted by Right
Ascension. This simple structure permits a surprisingly rapid access.
Although sorting works well for small tables, it does not scale very
well. Access to a given row in a table of N rows takes typically log2N
disc seeks, whereas using a B-tree holding say 100 rows per bucket will
only require log100N seeks: for a table of a billion rows
this the B-tree is likely to be around six times as fast. Sorting is
only applicable to read-only tables, perhaps why no professional DBMS
supports it, and of course only works in one dimension.
MySQL, Sybase, and SQL Server appear to be the most popular
DBMS used for astronomical data archives, but as far as I know all
sites using them index their source catalogues along one coordinate
axis only. An index on RA is generally more efficient, as sources in
the sky will have a more uniform distribution in RA than in DEC. On
the other hand the wrap-around of values at the zero-point of the RA
scale is a serious complication. Many services have therefore chosen
to use their single index along the declination axis. This is
obviously not very efficient, as the index is effectively selecting not
a small patch of sky but a strip around the entire 360° strip of RA.
The effect in practice is shown by the following table of results,
obtained on the whole 2MASS catalogue of 470,992,970 rows and 60
columns, ingested into Postgres. Cone searches covering an area of
0.01 and 2.0 square degrees were carried out, when the table was
accessed by either a B-tree on declination, or an R-tree on both RA and
declination. The elapsed times are in seconds.
The superiority of the R-tree is overwhelming on a table of this
size. The R-tree access times reduce further, to a millisecond or so,
when the same (or a similar) query is repeated, as the required index
blocks are cached in memory, whereas repeated B-tree accesses require
scanning rather a substantial fraction of the original dataset, and
these hardly speed up at all.
A spatial index is not only beneficial in improving performance,
but also provides a powerful tool which would allow astronomers to
select sources according to the projected sky density in the area of
interest. This may be necessary to find the probability of false-alarm
in identifying sources from one table to another. It also allows one to
answer questions like: "give me a list of quasars located in areas
where their density is larger than 15 objects per square degree".
5.4 Query Language
All modern DBMS speak nothing but SQL, in theory a standard query
language, but in practice a collection of dialects. The SQL needed for
a cone search also varies depending on whether a particular DBMS uses
simple B-tree indexing, or an R-tree. In the latter case there are no
widely accepted standards for the notation. In both cases the first
step in performing a cone search is to use the index to select a
box-shaped patch of sky drawn around the circle. Suppose we are
searching for a cone specified by (cra, cdec, cradius) then the
exscribed box will have coordinate ranges of approximately
ramin = cra - cradius/cos(cdec)
ramax = cra + cradius/cos(cdec)
decmin = cdec - cradius
decmax = cdec + cradius
A slightly more messy bit of trigonometry is needed near the poles,
where the scales are more severely distorted, or the areas are larger.
The basic SQL for the search for a table indexed with a B-tree can
just specify the range in each coordinate and let the SQL optimiser
work out which one can be index-assisted:
SELECT * FROM table WHERE ra BETWEEN ramin AND ramax AND decl BETWEEN decmin AND decmax ;
whereas for an R-tree assisted table in Postgres, one has to specify
the overlap of spatial rectangles:
SELECT * FROM table WHERE tbox && BOX(POINT(ramin,decmin), POINT(ramax,decmax)) ;
In both cases this selects points within a rectangular box. Some
additional trigonometry is needed to select just those points within
the circle. The simplest great-circle distance function must not be
used, as its numerical accuracy falls dramatically with small angles; a
better formula involves haversines. Unfortunately this involves
finding the square of a sub-expression, and Standard SQL is remarkably
defective in having neither an operator nor a function for finding the
square (or any other power) of a numerical value. All actual DBMS
remedy this defect, usually with a function called POW or POWER or
SQUARE. The SQL below assumes the table has celestial position columns
called
ra and
decl (because
dec is a reserved word in SQL),
and that we are using Postgres, with an R-tree index on a column called
tbox,
and that all coordinates
are in radians. This simplifies the statement to:
SELECT * FROM table WHERE tbox && BOX(POINT(ramin,decmin), POINT(ramax,decmax)) AND
ASIN(SQRT(POW(SIN(RADIANS(decl-cdec)/2),2) + COS(RADIANS(cdec)) *
COS(RADIANS(decl)) * POW(SIN(RADIANS(cra-ra)/2),2))) < cradius ;
Since SQL is so poorly standardised, distributing actual SQL around a
number of sites is unlikely to be feasible. Clearly what is needed is
a standard notation for a cone search, maybe a simple function such as
cone(ra, dec, radius) or
perhaps with the equinox as a fourth parameter. This could be
implemented at each site either by translating this pseudo-SQL to the
actual SQL required locally, or by implementing suitable user-defined
functions. This is also an obvious way of overcoming the problem that
different catalogues will use different names for the RA and
declination columns.
Given suitable standards for the query language and for the network
interfaces (such as ASU or the NVO's "Simple Cone Search", the
distributed cone search should present no difficulties, as far as the
input is concerned. The problem of federating the output are more
formidable, and discussed in
DataFedeReport.
6. Catalogue Cross-matching
6.1 Scientific Requirements
The most basic requirement is that of cross-matching all sources in one
catalogue with those of another (for convenience we shall assume that
catalogue2 is the larger one). The basic service should do this on the
basis of approximate position match, using columns of RA and DEC in
each table (assumed using the same equinox, or suitably precessed).
The absence of a matching source in catalogue2 is of scientific
interest in many cases: this corresponds to an OUTER JOIN in database
terms. There are perhaps two rather different practical cases to
consider:
- The user has a smallish source list which is to be cross-matched
with one or more standard catalogues.
- The user wants to cross-match one standard catalogue with
another. These may be resident on the same server, or not.
6.2 Current Services
Only a few public cross-matching services are currently available, and
these have significant limitations. This is a list of all those that I
know about, but is probably not exhaustive:
- The Vizier service of CDS allows a user's own list of sources to
be uploaded and cross-matched with any reasonable selection of the
source catalogues resident there. The input file format is free-format
text; there is a good choice of output formats including HTML and all
three types of VOTable. There are limitations on the number of
catalogues and the total number of rows output, designed to prevent any
one user hogging the resources. A more serious limitation, in my
opinion, is that if you upload a list of S sources and cross-match with
T tables, you get a total of (S × T) separate tables returned; if
you want to merge them you have to do it yourself.
- The Astrobrowse service of HEASARC has a small collection of
standard catalogues which can be cross-matched with HST and CFHT
catalogues.
- The SkyQuery service at JHU allows SDSS results to be
cross-matched with 2MASS and/or FIRST catalogues. This is a true
distributed join, but is feasible because it only allows sources from a
small area of sky to be joined.
- Several on-line catalogues include the results of some
cross-matching exercises carried out during production. A recent
example is the 1XMM catalogue giving results from the first year's
serendipitous detections in fields of XMM-Newton: the X-ray positions
were cross-matched with several well-known standard catalogues. The
version of the catalogue from ESA/Vilspa contains the information in
one wide FITS table: the Vizier version remotes the cross-match
information to a separate table.
6.3 Implementation Options
One way of implementing a spatial join is just to read through table1
sequentially, and for each row do a cone search on table2. This works,
but is very slow. A typical cone search service takes a second or more
per search, and given the overheads of setting up each search, it is
unlikely to get much faster. To get higher performance one can import
both tables into a spatially-enabled DBMS and perform a spatial join.
I have found that with suitable indexing a spatial join can be done at
a rate of over 10,000 rows/second which makes joins on small tables
take just seconds; even at this rate joining a table of the size of
2MASS (nearly 500 million rows) would take most of a day.
Work that I carried out last year showed two possible algorithms for
the spatial join: the pixel-code algorithm which utilised a mapping of
spatial coordinates to numbered pixels, and one based on a true spatial
index such as an R-tree. The preliminary results on smallish datasets
suggested that the pixel-code method was faster, but there were doubts
about the scalability because it required a SELECT DISTINCT stage,
which involves sorting the output rows, which does not scale linearly
with dataset size. A third possibility which I wanted to investigate
was using just a simple one-dimensional index on, say, declination,
which works fairly well for the cone search on small tables.
6.4 One-dimensional Indexing
All DBMS seem to be able to
build B-trees on floating-point data types (real or double precision)
and use them efficiently to handle range queries, like this:
SELECT * FROM table WHERE ra BETWEEN 123.45 AND 123.47 AND ... ;
This could be the basis of a simple cone search, with tolerable
efficiency on small tables. I thought that it might be possible to
perform a join on a similar basis. But the first part of the
equivalent join condition has to be written something like this:
SELECT * FROM t1, t2 WHERE ABS(t1.ra - t2.ra) < maxoffset ;
Somewhat to my surprise, even when indices were available for the both
declination columns, not a single optimiser was able to make use of
them to perform an indexed join. All of them would have gone on
happily to do a brute force nested-loop join, but this would have taken
some weeks.
Many DBMS support the creating of an index on two or more columns
jointly (e.g. on colour and make of a car) so that queries which
specify both values can use the combined index to find the relevant
row. This method is used by the SDSS team using SQL Server (which has
no true spatial index) to speed up their cone searches. Clearly you
cannot combine two floating-point columns in one index, but it is
possible to produce an index on say FLOOR(DECL) combined with RA. The
effect is much the same as that of dividing the sky into declination
strips, as in USNO-B accessed by WCSTOOLS. Unfortunately all the DBMS at
my disposal also refused to use such combined indices in spatial joins.
It is, of course, possible to reduce both RA and DEC to integers, and
produce a combined index on this integer pair. If you just want an
exact match condition, the join can indeed be performed at high speed.
Unfortunately this does not quite solve the scientific problem:
wherever you put the boundaries between one integer value and the next,
some sources will have error-circles which overlap two integers. If
you want to process these cases properly, you get involved in a much
more complex solution, since one source corresponds in general to more
than one pixel or integer pair. In fact this reduces to the pixel-code
method, in which pixels are simply numbered somehow, and have a
standard B-tree index built on them.
6.5 Spatial Indexing in Postgres, MySQL, and DB2
These three DBMS were chosen
because they all had a spatial data support built-in, or available as
an option. A full report is provided separately, but basically my
findings were as follows:
- Postgres has R-tree
indexing built-in. This was well-documented and easy to use. The
performance was good. I found no serious problems in using Postgres,
although as with all DBMS there are some quirks of which users have to
be aware.
- MySQL has
R-tree indexing supported only in the latest Alpha release. It is
designed to be compatible, eventually, with the specifications of the
OpenGIS consortium. Unfortunately these are not very well aligned with
our requirements. I found that loading spatial data was very difficult
and slow, and that the spatial indices were much larger than those of
Postgres, as the simplest spatial object we can use is a set of five
points specifying 4-sided polygon. My tests showed the performance to
be much worse than Postgres, and I have subsequently seen a newsgroup
report giving similar results. This is not surprising since spatial
indexing has only just been added to the DBMS. MySQL has a
number of other limitations compared to Postgres, and overall is not
favoured.
- DB2 (from IBM) is a
mainframe database now ported to Linux. It has a Spatial Extender
option based not on R-trees but on a multi-level grid file. This has
to be installed as a separate package, complete with a 572pp manual.
Overall I found DB2 quite the most user-hostile piece of software that
I have used for many years. Because it took me so many weeks to get
data loaded and basic operations working, I did not discover until a
late stage that I lacked instructions on installing the necessary
licence for the Spatial Extender, and I eventually ran out of time. It
is a pity that I was unable to test it, but I suspect that the Spatial
Extender would not have been as good as the simple R-tree index of
Postgres. The multi-level grid file needs to be matched to the problem
domain rather carefully, as it is not self-adjusting to the data range
like an R-tree. It also requires a polygon to be defined, with a
rectangular box (requiring 5 points) probably the simplest which suits
our needs. DB2 has a circle data type, but it is apparently
implemented as a polygon of 97 sides, which is unlikely to be processed
very efficiently. It is also interesting that Informix (now owned by
IBM) supports R-trees, and I wonder if this will be retro-fitted to DB2
if the products merge.
6.6 Pixel-code versus R-tree
Because of the poor performance
of MySQL, and the lack of time to get DB2 Spatial Extender
working, I was able to compare these two algorithms only using
Postgres. The table below shows the time taken to perform the cross
matching of a variable number of strips of 360° RA by 0.1°
declination of USNO-B with a chunk of the 2MASS catalogue covering
360° in RA by 3.5° in declination, and with the declination
ranges overlapping. The times did not include the time needed to
ingest the raw data, but did include time to create the necessary
indices on the first table, and the join itself. All times are in
seconds of elapsed time.
It is clear that the R-tree method is faster, and that its advantage
increases with dataset size. The pixel-code algorithm is more
complicated, involving the creation of additional tables, and a final
3-table join. Note that a typical strip of USNO contains 637,000
sources, so that the R-tree method is joining them at a rate of around
10,000 sources/second. This seems a reasonably good speed. It means
that anyone arriving with a new table of modest size, such as the 1XMM
catalogue with its 50,000 sources, can expect to be able to cross-match
it with a standard catalogue in just a few seconds. The time taken
should vary only slowly with the size of the second table, as it is
accessed via an index,with a time probably proportional to something
like log
100N where N is the number of rows.
6.7 Distributed Join
We would really like to know whether it is feasible to carry out a
distributed join, that is to join two tables which are not installed on
the same computer. Large astronomical source catalogues are found on
servers all over the world and it is obviously much simpler if they can
be used in-situ. This section outlines a number of possible ways
forward:
- Perform the join as a sequence of cone searches: this will
obviously work, but it may not be possible to increase the speed much
over one row per second, which will be unacceptable in most cases.
- Send the smaller table across the network by FTP (or gridFTP or
whatever) and join it there. This is not quite a distributed join, but
is likely to be one of the most efficient solutions. The basic problem
is that a join requires most of the rows from table1 (all the data if
an outer join is required) and at least a similar number of rows
extracted from table2. The user generally wants most or all of
the columns from both tables to be present in the output, so a
substantial volume of data, perhaps amounting to the whole of one
table, will have to be moved from one system to the other. If
this is done in small chunks under the control of the DBMS it is likely
to be much less efficient than if performed in transfers optimised for
bulk data flows, such as FTP.
- Use a DBMS which supports distributed tables, such as Oracle or
DB2. I have not had time to try this out using DB2 (it needs yet more
optional packages to be installed and licenced) but there seem to be a
number of limitations. The facility is really designed for use in a
company intra-net, and requires a considerable degree of control over
both local and remote DBMS. I can find no information on the IP ports
used, so have no idea whether it will work through typical campus
firewalls. All that I have read so far suggests that it is an advanced
option which needs an experienced database administration team some
time to implement. The performance will, obviously, be limited by the
network, and it may be that the latency (the round-trip time for
messages) is as important as the raw bandwidth.
- The Distributed Query Project (DQP) of the OGSA-DAI project http://www.ogsadai.org.uk/dqp/
is interesting and worth further investigation. Potential problems
include the query language which is OQL not SQL, the only join supported
at present is an equi-join, and it is built upon JDBC which has no
specific provision for spatial indexing.
- A package from INRIA called C-JDBC supports RAID-B (redundant
array of independent databases) also performs distributed querying
using JDBC. This may also merit further study. For further
details see http://c-jdbc.objectweb.org/current/doc/userGuide/html/index.html
My guess is that it should be feasible to join small tables over the
network, but it will not be possible to extend this to large
tables. But we have little idea, yet, what the size limits are.
Let us assume that a distributed join is not generally feasible, and
that some sort of data mining centre or data warehouse or astronomical
data exploration facility (ADEF) has to be set up. The next few
subsections examine the most important cases.
6.9 Joining a user-generated table with a standard catalogue
Upload: the first step is that
of uploading the user's table to some suitable server holding the
standard catalogue(s). The main problem here is to choose a suitable
file format. Possible choices are:
- VOTable - the politically correct choice. But so far barely used
in the astronomical community and I know of no utilities to create
them. Once you have managed to create a VOTable file, the Starlink
package TOPCAT could be used to display and edit the contents. Another
problem is that the most efficient of the three format options has the
data in a separate FITS file: this means there are two files to upload.
If VOTable were to be used, then some additional software would be
needed to convert it to some format, like CSV, that a DBMS knows how to
ingest.
- FITS - the format currently most used by astronomers for tables,
because generated by programs like SEXTRACTOR, and down-loadable from
so many archive sites. And there are many tools, such as fv and
FTOOLS, to manipulate them. Again if used for upload, the DBMS wrapper
would have to be provided with a converter to CSV. For my own tests I
produced a prototype which generates an SQL create statement and a CSV
file, but this is somewhat DBMS-specific.
- CSV (character-separated value, or maybe comma-separated value),
or TSV (tab ditto) - this is not so much a standard as an ad-hoc
"obvious" way of transferring files. Problems include:
- Which field delimiter does one use: comma, tab, and vertical
bar are common choices?
- The record delimiter is usually a newline, but of course this
differs between Unix/Linux and MS-DOS/Windows. Macs used to be
different from both, but maybe with their latest O/S based on Unix,
this has changed?
- How are nulls represented: an absence of a value sometimes
works, or there may be some specific marker (both MySQL and
Postgres accept \N but
will allow this to be changed)?
- Do string fields have to be enclosed in a pair of quotes or
apostrophes, or can they be so enclosed if the field itself contains a
field separator; if not how do you handle fields which happen to
contain field separators?
- How are date/time values represented?
Unfortunately, nearly all DBMS have no usable binary format for import,
so some sort of CSV file is the only realistic option. This loses all
metadata, and needs some conversion software yet to be written.
The conversion program can, of course, also create a
CREATE TABLE statement, which
will ensure the column names and data types are correct and aligned
with the data file. In principle it could also create a metadata
table to go alongside the data table, which could keep track of
physical units or other information about the substantive table and its
columns. Whether such a system could be used consistently has not
yet been explored fully.
Create table, ingest: these
stages require the user to have write privileges on the DBMS, which
will in general need support by the authentication and authorisation
system of
AstroGrid. It is then necessary to create a column of
spatial data (a box for the R-tree case) and issue an UPDATE statement
to populate it from the position and error-circle columns. This also
needs write access to the table (but perhaps
MySpace can take care of
this).
Join: this operation is then fairly straight-forward on Postgres
using R-trees, although there are lots of options that the user has to
specify, such as: which columns to be produced in the resulting table,
do these included computed quantities such as position offset between
the two tables, and is an inner or outer join to be performed.
Export results: A more serious
choice has to be confronted over where the results should go, and in
what format. Again the obvious possibilities are:
- Leave them as a new table on the DBMS: this is probably the most
useful option, as this allows the user to perform further DBMS
operations on the results.
- Generate plain text results with minimal formatting: this is the
default for all DBMS, and may be useful for testing.
- Generate an HTML table which can be directly displayed on a
browser: this is conventional for current on-line DBMS, but not really
practical for large tables, or for ones generated non-interactively.
- VOTable: this needs the output from the DMBS to be processed into
this form, and the user needs to choose where the VOTable should be
sent, or be told where to pick it up from.
Clearly the
MySpace system should be able to handle the last option,
but it is not clear to me how or whether it can be involved if the data
are left in the grasp of the DBMS. Even if
AstroGrid recommends a
particular DBMS, we shall have to accept that for the foreseeable
future the data mining sites (or whatever we call them) will actually
be run by a variety of different DBMS, and they all have their own
idiosyncratic ways of authorizing users, ingesting data, and exporting
results.
Merge results: if the user wants
to cross-match with two or more standard catalogues, we have the
problem of merging the output tables into some more digestible form;
this is explored in a separate report.
Clean-up: the resources used for
the data ingest, the indices, and the joined results presumably need to
be reclaimed after some suitable interval. This, in principle, is
another job for
MySpace.
6.10 Joining one standard catalogue with another on the same DBMS
This is actually a simpler case than the previous one, assuming that
both standard catalogues are tables resident in a spatially-aware DBMS
such as Postgres, with suitable box columns generated, and indexed.
The join operation presents no problems, except that it may take some
time. The results will usually be much larger in volume, and the export
and clean-up problems are as above. The main problems will probably be
administrative, since a much larger allocation of cpu-time and disc
space will be necessary.
This is, perhaps, rather an extreme and over-simplified case.
A one-to-one source match is not always what one desires from a cross
match. One may well want to list all galaxies located less than 1
degree from the centre of a list of Abell clusters. In addition
one does not always want to cross-match the whole of a catalogue: a
more likely case is that of selecting a particular region of sky (say
high galactic latitude) or a particular class of object, and then
joining that subset with another catalogue. Whether such a join
can be done most efficiently by using the spatial index or not depends
on the particular circumstance. In principle the query optimisers
of major DBMS are becoming quite good at working out the best execution
plan, but sometimes they need help. An EXPLAIN command is often
provided so that one can work out how the DBMS is going to do an
operation in advance.
6.11 Joining one standard catalogue with another located elsewhere
The problems of doing this were explored in section 6.7 above. At
present the only option that we know will work is to physically copy
the required catalogues to the same server, using some FTP-type
service, ingest it, and create the required spatial index. The problem
then reduces to that of section 6.8.2.
The majority of standard catalogues are, in principle, available to
anyone - that may, indeed, be a definition of a standard catalogue. In
practice, though, there may be a number of hurdles to overcome. The
physical size of the table is just one of them: network transfers
currently start to be difficult beyond about a few hundred gigabytes.
In principle, with faster and more reliable networks, it should not be
difficult to transfer terabytes of data over the UK and European
research networks. Perhaps we can benefit from the work that the
particle physics community have put into fast bulk data transfers.
A second major problem is that the astronomical community has not come
up with anything resembling a "standard" format for the distribution of
catalogue data. In my tests over recent months I have grappled with
the following:
- FITS ASCII tables, with each table covering one small tile on the
sky, typically a few degrees across. The HST Guide Star Catalog is
distributed on CD-ROM in this format.
- FITS binary table, with one big file covering the whole sky. The
1XMM catalogue from the XMM-Newton Survey Science Centre is distributed
from Vilspa in this format: the table has around 50k rows and 400
columns.
- Specially-invented binary format, with the sky divided into
narrow declination strips, and data sorted into increasing RA within
each strip. The USNO catalogues are distributed (on hard disc) in this
way, but the format of USNO-A2.0 and USNO-B1.0 is quite different,
requiring different software.
- Compressed CSV files, with the sky divided into narrow
declination strips, and data sorted into increasing RA within each
strip. The 2MASS catalogue is distributed (on a set of DVDs) in this
way.
- Catalogues can be downloaded from Vizier in a variety of formats:
for large tables it is recommended that one uses a plain ASCII
fixed-format file, FITS (ascii), or VOTable (with the actual data in
FITS).
7. Exploratory Data Analysis
Before data can be mined on a large scale, it is usually necessary to
examine its properties, check quality, weed out the rogue data points,
and so on. These activities are so common that the acronym EDA is well
established in statistical circles.
7.1 Operations feasible with a DBMS
Some of the more basic activities are reasonably well supported by the
relational DBMS, for example:
- Select a subset using an expression on some of the columns, e.g.
to select part of a brightness distribution, or rows just at some
particular galactic latitude range.
- Sort the results in ascending or descending order, and maybe
present just the top N values.
- Create and populate a new column, such as hardness-ratio computed
from fluxes or magnitudes, or perhaps precess positions to a new
equinox.
- Select a smaller randomly-chosen sample for further work.
- Compute simple statistics, such as mean, variance or standard
deviation.
- Compute a histogram of values (using the GROUP BY function of
SQL).
Some of these are not done very well: for example the simple GROUP BY
function will omit from a histogram an ordinate where no points are
found to lie, which can be confusing. More advanced statistics, such
as finding a median or other quantile, can be done with difficulty in
SQL, but efficiency is low compared to custom code. Some of the
missing functionality can probably be restored with a few user-defined
functions: functions for coordinate conversion, precession, proper
motions, and statistical tests are obvious candidates. A more serious
problem is that most of the preceding operations require a sequential
scan of the entire table. This issue is examined further in the next
section.
7.2 Other DBMS Issues
SQL is really not designed for exploratory data analysis, as its
design is based on the principle of stating the end result, and letting
the DBMS get on with it. The default destination for results is a text
representation sent to the user's terminal, to issue a query with the
results saved as a new table takes extra work. For EDA what one really
wants is a system which automatically saves each result as a new table,
and uses that by default for the new query. In case this is not clear,
here is a simple example. Suppose we are exploring the values at the
extreme of some distribution, say over 3 sigma above the mean, one
could approach this stepwise:
- Compute the mean and standard deviation
- Select a subset of points more than (mean + 5 × sigma).
- Sort this into descending order
- Display the top 10 values.
The simplest way to do this in SQL is as a single query which involves
a sub-query. It cannot be done (as far as I know) in standard SQL, but
in the Postgres dialect you would issue a query something like this:
SELECT h_m, k_m, h_m-k_m AS diff FROM tmass WHERE h_m-k_m >
(SELECT AVG(h_m - k_m) + 3.0 * STDDEV(h_m - k_m) FROM tmass)
ORDER BY h_m - k_m DESC LIMIT 10;
Many DBMS support the creation of a
view:
a virtual table created by some selection on its parent. Operations on
such a view will, however, be no faster than on the parent, as (apart
from some benefit from disc caching) the view does not really exist, it
is just a convenience to simplify subsequent SQL, so views are not
really a satisfactory substitute for what is wanted.
The Starlink package TOPCAT supports a number of useful database-like
operations including selection of rows, sorting, and plotting, and it
does so in an environment in which each operation can be applied either
to the base table or the subset produced in the preceding stage. This
is a much better match to the requirements of the exploratory data
analyst.
There are various other considerations that we need to take into
account in designing a system, for example:
- We need to handle self-correlation, i.e. joining a table with
itself. This may be useful for quality control or to find density
information, e.g. source clustering, or for other reasons.
- We need to handle extended objects such as clusters of
galaxies. The pixel-code method starts to get very inefficient
whenever the typical object covers more than one pixel; R-tree indexing
probably gets less efficient as well, but we have not yet measured how
much.
- We need to handle source-dependent radii: for example "find all
QSO within 10 galactic radii of all galaxies".
- We need to handle much larger spatial offsets than just the small
errors in positions, for example another realistic query is "find all
cases in which there is a bright stars (suitable as a phase reference
for adaptive optics) within N arc-minutes of all sources of class X.
- Given that typical joins are slow, the user should have the
ability to estimate the number of resulting matches, or the time likely
to be taken in a query, before actually submitting it. In some
cases the indices or the EXPLAINfacility
of the DBMS can be used to estimate this.
7.3 Operations beyond the DBMS
Quite a range of other operations of interest to the astronomer,
exploring a new dataset, seem to be rather beyond the capability of the
DBMS. This is not an exhaustive list, but some examples include:
- Regression analysis - cross-correlation of one column with
another to find association rules.
- Deviation detection - finding outliers from some association rule
(simple outlier detection from the distribution of values in a single
column is obviously possible with SQL, as noted above).
- Sequence analysis - one is not allowed to assume any specific
ordering of rows in a relational database, so that most of the
operations of time-series analysis need functionality beyond that of
the RDBMS.
- Clustering and classification analysis: one might be able to find
simple associations in space by performing a spatial self-join of a
table, but more advanced clustering analysis needs specially-written
software.
- Grapical operations, such as plotting one column against another,
projecting positions into an image, and all other visualisation
operations.
For all such operations it will obviously be necessary to export the
data and ingest into another package. We have not yet had time to
examine any of the possible packages in detail.
8. Performance Enhancements
Many queries will require initially a scan of the whole of a large
table in order to select a subset of interest according to some set of
criteria: - astronomers will want to run a huge variety of different
queries often on functions of values in columns. A very simple
example might be:
SELECT * FROM twomass WHERE (hmag - kmag) > (kmag - mmag) AND abs(Galactic_latitude) > 15.0;
Unless indices have been built on expressions such as
(hmag-kmag)
or
(kmag-mmag) or perhaps on
GalacticLatitude then
such a query will be done by a sequential search. An index will
be worth-while only if the same sub-expression is likely to be used a
number of times. Building an index requires a sequential scan of
the whole table, of course.
It is not too difficult to write software to optimise the reading of
all data in a single column (or a few). My experiments last year,
reported in
DbmsEvaluations
showed that even a FITS table could be read at a speed essentially
limited by that of the disc I/O channel. More recently Patricio
Ortiz has been writing an experimental system as part of his Datoz
which can read a whole column of USNOB in about 90 seconds with one
processor, which is a rate around 45 MB/sec. If we could do that
on an 8-way parallel system, that would be reduced to about 10 secs,
perfectly acceptable for interaction, if all I want to do is to select
sources with
Bmag2 > 14. The DBMS under study performed
sequential scans substantially slower than expected from the
performance of the disc systems. These figures are not strictly
comparable, as different disc systems were involved, and the DBMS were
probably not optimally tuned for such queries, but give some idea of
the problem.
| Software Package |
Data transfer rate |
| DB2 |
8.6 MB/s |
| MySQL |
11.2 MB/s |
| Postgres |
4.6 MB/s |
| cFITSIO on row-oriented table |
7.2 MB/s |
| cFITSIO column-oriented table |
21.2 MB/s |
| Datoz (column-oriented with data compression) |
42.1 MB/s |
It is notable that a standard library accessing a FITS table can
achieve about twice that of the best DBMS, while custom code (Datoz)
can almost reach the speed of the disc channel (around 50 MB/s).
There are several ways in which we might be able to obtain better
performance:
- Tuning existing DBMS packages better.
- Distributing data over many nodes in a cluster.
- Using column-oriented storage.
The major DBMS products such as DB2, Oracle, and Sybase suggest that
the best performance can be obtained by allowing the DBMS access to the
raw disc partition, as this avoids the overhead of the operating system
and file structure. We were unable to test this with DB2, as it
would have caused too much disruption to a system used for other
purposes.
With mostly read-only tables, there could be a considerable speed
advantage if sequential scan queries were spread over a number of nodes
in a cluster. Only a few DBMS seem capable of doing this, among them
DB2. Even here there are some limitations, essentially because
these DBMS are designed for handling transactions, which need a lot
more locking and synchronisation when performed over a distributed
system. Both MySQL and Postgres have some facilities for
database replication, essentially in order to increase robustness. It
seems possible that we could split our tables arbitrarily between
separate instances of, say, Postgres, and send appropriate queries to
several nodes in parallel, and combine results. At present, though,
there do not seem to be any facilities built-in to make this easy and
considerable experimentation would be needed to pursue this further.
It is not too difficult to write software to optimise the reading of
all data in a single column (or a few). My experiments last year,
reported in
DbmsEvaluations
showed that even a FITS table could be read at a speed essentially
limited by that of the disc I/O channel. More recently Patricio
Ortiz has been writing an experimental system as part of his Datoz
which can read a whole column of USNOB in about 90 seconds with one
processor, which is a rate around 45 MB/sec. If we could do that
on an 8-way parallel system, that would be reduced to about 10 secs,
perfectly acceptable for interaction, if all I want to do is to select
sources with
Bmag2 > 14.
When looking at commercial data mining products I came across a couple
of products using column-oriented data storage. Normally a DBMS
stores data in a row contiguously, as this makes updates and deletions
faster, but it means that if you want at least one item from every row,
you have to read the entire table from disc. For essentially read-only
data, and queries which involve only a subset of the columns in a
table, column-oriented storage makes good sense: it means that the only
data blocks you have to read from disc are those holding data from the
required columns. Since many astronomical tables have 30 to 50 columns,
and one has more than 400, this can be a significant gain. The two
packages I have discovered so far are: Sybase-IQ and Kdb.
Sybase-IQ is a product of the Sybase company, but entirely separate
from their relational DBMS, though it seems to accept the same dialect
of SQL. Jim Lewis of
IoA Cambridge has had a copy for evaluation, and
found it to perform well in the type of queries which do indeed involve
just a few columns. Exact benchmarks are not available, because of
hardware differences. For some types of exploratory analysis it seems
likely that Sybase-IQ would be useful. At present there are two
obstacles: firstly it does not yet run on Linux at present (though a
port is in planned), the second is the high licencing cost, since the
list price is over £30,000 pounds per node.
Kdb comes from a company called
Kx Systemsand all I know about it is what can be seen on their web-site.
Many DBMS support the creation of a view: a virtual table created by some selection on its parent. Operations on such a view will, however, be no faster than on the parent, as (apart from some benefit from disc caching) the view does not really exist, it is just a convenience to simplify subsequent SQL.
9. Conclusions and Future Work
Some conclusions from the work so far are:
- Spatial joins based on the R-tree index are faster, simpler, and
more scalable than those based on the pixel-code algorithm.
- For spatial joins and searches, the facilities of Postgres are
easier to use and much faster than those of MySQL.
- The facilities of DB2 may be usable, but it seems to take a great
deal longer to get such a mammoth product configured and working.
The lack of any representation for nulls in external files seems a
major drawback, and in general it is much less suitable for use by
scientists than the other DBMS tested.
The plans for future work in this field are not yet decided.
Watch this space.
Author: Clive Page, with additional contributions from Patricio
Ortiz.
--
ClivePage - 26 Sep 2003
--
ClivePage - 1 Dec 2003 (fixed a few typos).