PhaseAReport

(7) Database Technology and Data Mining

(7.1) Introduction

In the commercial world almost all large data collections are stored within database management systems (DBMS), but in astronomy DBMS are used very little except in the management of data archives. Even within astronomical archives the bulk of the data are generally stored in external files (such as FITS files) with just the filenames stored in the database.

The aim of work package A4 was to investigate whether modern DBMS technology could be used more widely in large-scale astronomical data management, especially in view of the data avalanche from new instruments. In particular we wanted to investigate the value of DBMS for what is loosely called data mining - the attempted discovery of valuable scientific information buried in large data collections such as from sky surveys, or from sets of observations originally gathered for some other purpose.

Ideally the astronomer will make use of the virtual observatory by posing questions and getting information back, but all we can reasonably expect is to allow users to submit queries and get results. The translations between the scientific and computing domains are likely to require at least some astronomical expertise for the forseeable future. The allowed forms of queries are inevitably a compromise between what the astronomer wants and what data archives can process, and similar considerations apply to the results produced. In order to understand the requirements better, we have analysised a large number of potential scientific problems and use-cases.

(7.2) Use-cases

The AstroGrid project agreed at an early stage to be use-case driven, and this analysis of database requirements draws heavily on our prior experience of using and managing astronomical data archives, and specifically on:

These use-cases were further deconstructed into basic data management operations.

It should be noted that the following classification is set out in terms of extra-solar astronomy, in which celestial position (usually specified as a pair of spherical-polar coordinates) is the most important indexing parameter, but a similar classification can be made in solar physics (in which heliocentric coordinates play a similar role), and in STP where event time is usually the primary data locator.

In database terms, there seem to be two main classes of query:

  • Positional queries: where the user wants information about a small patch of sky around a given celestial position, or about a named celestial object (which can be converted to celestial coordinates using a name resolving service such as Simbad or NED).
  • Non-positional queries: essentially everything else. This includes queries which require a sequential scan of all (or much of) large datasets, or which involve I/O-intensive joins between two or more tables.

(7.2.1) Positional Queries

Positional queries are especially important in astronomy since many astronomers spend long periods investigating one celestial object in detail. Indeed such queries are so common that most archives have been set up to handle them optimally (or even exclusively). Positional queries are also special in database terms because they can be handled efficiently by indexing on celestial position, (although in practice two-dimensional indexing presents some interesting problems). Although in theory the user of a database should not need to know which parameters have been indexed, in practice the sequential scan of a billion-row table may take hours, while and an indexed look-up in the same table will take no more than tens of milli-seconds, so indexing cannot be ignored as a mere implementation detail.

Positional queries can be further subdivided based on the main types of information stored in astronomical data archives:

  • Queries of source catalogues, for example:
    • Are there any radio sources near to HD123456?
    • Has anyone measured the B magnitude of 3C678 between 1990 and 2000?
    • Can I have a list of all known objects within 2 arc-minutes of PSR1234-567?
  • Queries of image repositories, for example:
    • Can I see an IR image of M87?
    • Is there an X-ray image of the sky around (12:34:56,-45:59)?
  • Queries of specific observatory archives, for example:
    • Show me the results from the IUE observation of Beta Hydri?
    • Can I download the raw data from the XMM-Newton observation of NGC9876 to re-analyse it myself?
  • Bibliographical queries. No work has been done on this area, because existing services such as Simbad and NED already do an excellent job.

Source catalogues are essentially tabular datasets listing the positions, fluxes, and other properties of all the astronomical objects detected in some area of sky in some waveband. The larger ones currently have around a thousand million rows and a few have over 100 columns. Some arise from the systematic survey of large areas of the sky; others are produced by analysis of the sources detected serendipitously in images of the sky obtained in other programmes.

Images: Collections of images of the sky are another important astronomical resource. The total volume of data in an image archive may be large: for example sampling at 1 arc-second resolution with 4 bytes/pixel one gets 2 terabytes for the whole sky, but the individual images are usually only of modest size, and the indexing requirements are also easy to meet. Although most DBMS can store images as binary large objects (BLOBs), in practice almost all archives have chosen to store images as external files (mostly as FITS images), to simplify interworking with other data analysis software.

Observatory archives are repositories of raw data or semi-reduced data. They are the province of the more expert users, seeking to download data for further reduction and analysis on their own computers. The access is usually via the observation log, a table of modest size usually indexed by celestial position. These logs are usually fairly small (under a million rows), so indexing is simple. The main complication is that the telescope field-of-view is often a complex shape, so determining whether a given celestial position was in the field-of-view or not may require fairly complex calculations.

To summarise: positional queries are generally handled fairly well by current archives, although many of them still depend on inefficient methods of sky-indexing (more on this below). But they do present a problem as far as data mining is concerned because of the variety of conventions they use for submitting queries and the variety of data formats for the data that they return. The standards being devised (described further in the interoperability chapter) will, we hope, result in these archives being accessible as Web Services, and so easy to integrate into an overall data mining infrastructure.

(7.2.2) Non-positional queries

Non-positional queries are those typically involving the study of a class or group of astronomical objects. Astronomers carry out such a wide variety of operations on the data retrieved from archives that it is virtually impossible to provide an exhaustive classification, but the following types of operation appear to be among the most important:

  • Cross-identification of sources from two (or more) source catalogues on the basis of positonal match and perhaps other criteria. (Note: a join is, of course, just a series of positional queries, but since an outer-join requires a complete scan of the first table, it really belongs in this category)
  • Selection from a source catalogue by reference to the properties of the sources, for example finding all stars in a stellar catalogue with a particular spectral type and range of proper motions.
  • Statistical estimations and searches: for example finding all sources with a colour index which is more than 3-sigma above the mean for that table.
  • Data mining queries of more advanced types, these include using:
    • Classification and clustering algorithms to find groups of objects with similar properties;
    • Regression analysis: finding combinations of properties which are significantly correlated;
    • Sequence analysis: carrying out time-series analyses to find periodicities, bursts, or other temporal anomalies;
    • Similarity or dissimilarity searches, for example analysing sets of images or source lists from different epochs to find objects which have moved or changed in strength.
    • Measuring properties of the sky on the large scale, using for example power-spectral densities.

(7.2.3) Cross-identifications

The cross-identification of sources from different source catalogues is an important basic operation often the precursor to more advanced data mining investigations. The operation occurs in most of the "top ten" AstroGrid science cases. This requires, at its simplest, what is often termed a fuzzy-join, since the source coordinates in each catalogue will have associated errors, so an approximate match within limits of error replaces an exact match, and one needs to compute the great-circle distance between points expressed in spherical-polar coordinates (RA,DEC).

Many existing data archives just use an index on one spatial coordinate (sometimes RA, sometimes declination) which does not produce an efficient or scalable solution. What is really required is a spatial index, for example one based on the R-tree, but but few DBMS have spatial indexing built-in, and the few which have them do not cope well with the singularities of the spherical polar system.

A new solution to this problem was discovered as part of the AstroGrid Data Mining investigations, based on covering the sky with a grid of approximately equal-area pixels, and using a one-dimensional index on these pixel values. The PCODE algorithm solves the problem cases in which error-circles overlap two or more pixels by inserting additional rows in the tables, and using the SQL DISTINCT function to remove duplicates from the resulting table of matches. The method can be used with any suitable pixelation of the sky, for example the Hierarchical Triangular Mesh (HTM) or Hierarchical Equal Area iso-Latitude Pixelation (HEALPix) algorithms, but in our tests only the latter was used. This PCODE method has the significant merit of only requiring an equi-join on integer fields, which occurs so often in commerce it is well optimised by all DBMS, and only needs simple B-tree indices on the PCODE columns. It can also be used with an outer join to determine sources in one table unmatched with those in another, which is an operation often scientifically useful.

The use of a uniform sky pixelation, such as HTM or HEALPix, will also permit existing data achives to implement a much more efficient sky indexing method; at present many of them just utilise a B-tree on one of the two spatial coordinates, a solution which will get more inefficient as datasets get larger.

It is important to note that source identication often depends on more than just positional coincidence (see this note on association methods) but the other operations are relatively fast once the fuzzy-join has been carried out.

One of the DBMS in our evalution, Postgres, has R-tree indexing built in. We therefore used it to compare the traditional R-tree solution with the PCODE algorithm using identical hardware and software. This showed the PCODE join to be about twice as fast, and while using substantially less disc space. Index creation was speeded up even more. Since the PCODE algorithm can be used with any DBMS, not just the few which support spatial indexing, this significantly improves our freedom of choice.

(7.2.4) Selection Operations

Simple SELECT operations are easy to do in all relational DBMS, but the speed with which they can be carried out depends entirely on whether a suitable index exists. Since many astronomical tables contain a large number of columns, it is not always easy to arrange this. For example the source catalogues produced from the Hubble Deep Field contain 123 columns. Even worse, astronomers often want to select on some expression involving multiple columns, such as a magnitude difference, or a flux ratio. In a wide table there could be thousands of possible simple combinations, and it is impractical to maintain indices on them all.

(7.2.5) Statistical Operations

Most DBMS support simple statistical functions such as mean and variance, but SQL is not an ideal language for statistical work. Ideally the entire table needs to be scanned to measure its statistical properties, but in practice a carefully-selected sampling scheme may speed up some operations. Nevertheless statistical operations are inherently slow, being mostly unable to benefit from the presence of indices.

(7.2.6) More Advanced Operations

Many of the more advanced operations will require the use of specialised software packages. If the data are stored within a DBMS this means either exporting them in some format (e.g. as a plain ASCII file, or preferably in something XML-based such as VOTable), or accessing the DBMS through some API such as ODBC, JDBC, JDO. We hope that current projects such as OGSA-DAI and Spitfire (from CERN) will produce even more powerful grid-enabled interfaces. But whichever option is chosen, many types of data mining operation will require sequential reading of substantial parts of large tables.

(7.3) Data Exploration and Data Mining

Just as in the real world of exploiting mineral resources, before any nuggets of astronomical value can be extracted from the mass of data, the landscape has to be explored thoroughly: indeed in science the term data mining is usually shorthand for both the exploration phase and the phase in which bulk data are sifted.

Ideally a data exploration and mining facility would give the astronomer the ability to access any dataset on the web without needing to copy it explicitly to a local machine. It is interesting to note that software based on the well-known FITSIO library, such as the FTOOLS collection from HEASARC and the CATPAC package from Starlink have supported transparent remote access for some years, as both FTP and HTTP protocols are built in to the lower layers of the library, while DBMS packages normally require all tables to be local, and they have to be explicitly imported to a given database. A few DBMS have very limited support for access to foreign files, but usually only to tables stored in their own format, or sometimes in the databases of one of the rival vendors, and only when the relevant clients have been installed.

Many of the functions needed in data exploration and mining are those that any DBMS supports, such as selecting data, sorting it, grouping, finding means and extrema, joining with other tables, projecting new columns, etc. Other essential operations can also be done in SQL-based systems but only with rather more difficulty, for example finding the median and other quantiles, finding outliers, computing trend-lines and regressions, computing statistics, etc. Specialised statistical packages support many of these operations rather better, but all of them seem to be based on memory-resident datasets, which seriously limits their use on large astronomical datasets. Astronomers will also need graphical output and simple visualisations of their datasets, e.g. 1-d and 2-d graphs, histograms, contour maps, etc. and ways of overlaying source positions on images. These are also well beyond the facilities of any current DBMS. There are many visualisation packages which have most of the right functionality, but they all have their own data formats, so data conversion problems abound, and metadata loss is usually inevitable on format conversion. These problems have no easy solution. The problem of preservation and propagation of metadata is especially acute when using a relational DBMS, and none of them seem to preserve any attributes of a column or a table, beyond the most basic (name, and data type).

Our conclusion is that if a DBMS forms the core of a system designed to support astronomical data mining, it will be just one of a number of software packages installed, because of breadth of functionality required by astronomers is so large.

It is perhaps worth noting the international dimension: although the UK is producing many datasets of world class which will be important for astronomical research over the next few years, many other important datasets are resident overseas. Almost every one of the important science cases depends on the use of one or more foreign data collections.

It has also become obvious that many operations will require access to remote datasets which the remote systems are not prepared to support. Obstacles and limitations include:

  • The necessary software will not always be installed on the remote system.
  • Some operations require scans of all or substantial parts of a large table, and some algorithms are not only I/O-intensive but also cpu-intensive.
  • Astronomical institutions are noted for putting a high proportion of their data on open access, but this does not necessarily mean that they also provide facilities for compute-intensive use. The majority of existing data archives restrict the amount of work that can be done, either by limiting the operations to those which are utilise the available indices, or by putting restrictions on cpu time or the volume of downloaded data. Such limits are entirely reasonable, since these facilities are currently open to the world, and unlimited access could easily be misused. It is to be hoped that the general adoption of an improved security infrastructure, based on the authentication and authorisation mechanisms currently being explored in the e-science community will allow these restrictions to be relaxed gradually. But for the foreseeable future it must be accepted that astronomers will often want to perform operations on remote datasets which the remote systems themselves cannot support.
  • Some operations will require a network bandwidth which is not yet available to the remote site, and network bandwidth is growing more slowly than either available processing power or disc space.
  • Some operations will be feasible only on links with low latency. At the recent "e-Science All Hands Meeting" in Sheffield the panellist from Microsoft Research, Dr Andrew Herbert, cited latency as the major obstacle to grid-based data-intensive operations, and no amount of technology will defeat the speed of light.

(7.3.1) The Astronomical Data Warehouse

In order to overcome these obstacles, we think that the UK astronomical community will want to set up a small number of data centres which we propose to call "data warehouses". The term is borrowed from the commercial world, where many large organisations have found it necessary to set up specialised facilities for analytic data processing and data mining, containing essentially static copies of datasets from elsewhere in their organisations.

The astronomical data warehouse has many elements on common, but differs in detail. We envisage a data warehouse as containing:

  • A powerful computer system, for example a Beowulf cluster;
  • Good network connections;
  • Local copies of the more popular astronomical datasets;
  • Ample temporary disc space for additional datasets to be downloaded on demand.
  • At least one powerful DBMS;
  • A fairly comprehensive set of standard astronomical data processing packages.
  • Grid software components at the very least for authentication and authorisation, and for efficient data transfer.

Clearly an Astronomical Data Warehouse will not be able to get copies of external datasets except with the permission and indeed cooperation of those responsible for them, but based on past experience, we expect no difficulties here. Indeed most sites seem fully prepared to make copies available to other sites who will in turn make them available to the astronomical community.

Although the term "warehouse" may be new, the concept is not much more than an obvious extrapolation of a trend which has been apparent for some years. Many of the most important astronomical data collections have at least one mirror site somewhere else in the world, and some of the most popular ones, such as Vizier, exist in more than half-a-dozen locations. Powerful data mining centres will be needed anyway for new archives such as those from e-MERLIN, XMM-Newton, WFCAM, and VISTA, and since all these datasets will be used in conjunction with those from other sites, the warehouse concept will follow fairly naturally. We propose to set up a dedicated AstroGrid data warehouse only to develop and prove the concept: we expect that the major data centres in the UK will find it worthwhile to provide data warehouse facilities, so that several will soon be set up within the UK. It will obviously than make sense to connect these separate warehouses to using data grid techniques, and in time it might become a seamless data mining facility for the UK community.

(7.3.2) MySpace

The MySpace concept is that authorised VO users should have a semi-permanent storage area allocated to them on the virtual observatory - no doubt some of this will physically reside on various astronomical data warehouses. The facilities will allow the user to publish fully reduced data and make it available to the rest of the world.

Neither of these concepts is yet fully defined, and their development and implementation will be undertaken during Phase B.

(7.4) DBMS Evaluations

Our DBMS evaluations were undertaken, not in the expectation that it would be possible to identify a single outstanding product, but in the hope that we could eliminate at least a few possibilities on the grounds of serious incompatibility with our general requirements.

(7.4.1) Object-oriented DBMS

The properties of the OO-DBMS look well-matched to the requirements of astronomical data centres in many ways, as they allow a schema to be exactly matched to the complex data structure, as often arises from astronomical observations. They also offer the prospect of greater efficiency, as many operations requiring a cpu-intensive join can be performed merely by following pointers. Unfortunately no OO-DBMS seems to have gained significant market share. Our experience is also that that few standards exist for interfaces and queries, and support for them is poor. We have links with two astronomical projects which have tried OO-DBMS with unsatisfactory results:

  • The XMM-Newton Survey Science Centres (at Leicester, MSSL, Strasbourg) adopted O2 for both the data pipeline management and the subsequent archive; after successive take-overs of O2 by Unidata, Ardent Software, Informix, and finally IBM, the product is essentially defunct, and maintenance has become extremely expensive.
  • The SLOAN Digital Sky Survey data centres adopted Objectivity/DB for their data pipelines and archive access, but after experiencing poor performance and unsatisfactory software support, they have switched to using SQL Server instead.

As a result we have only taken an interest in relational DBMS (although most of them now claim to be object-relational, a term which has no very precise definition).

(7.4.2) Relational DBMS

We chose five DBMS for examination, because of their existing usage in astronomy or their known characteristics:

  • DB2 (from IBM) - heavyweight commercial product with reputation for good standards-conformance.
  • MySQL - open source product with a reputation for speed; used by many existing astronomical data archive sites; now supports transactions but still has less complete coverage of SQL than the others.
  • Oracle - leading heavyweight commercial product.
  • Postgres - open source product with multi-dimensional indexing using R-trees.
  • SQL Server (from Microsoft) - full featured and easy to use, but only available for Windows platform.

(7.4.3) Results

Detailed results from our evaluations to date may be seen on the wiki but the principal finding are summarised below.

  • No complete show-stoppers were found, although a number of unexpected obstacles to easy use by astronomers were uncovered.
  • None of the DBMS provided a good way of importing bulk data from, or exporting to, binary files - the use of text files wastes bandwidth, cpu time, and disc space, and risks losing precision and metadata.
  • All DBMS supported indexed operations very efficiently, but were slower than expected on sequential scans of large datasets (more on this below).
  • All had similar performance, although because of different platforms used we were only able to compare MySQL and Postgres directly; here MySQL was often twice as fast as Postgres, and much faster in a few special cases where results were cached.
  • None of these DBMS implemented full standard SQL92, despite having now had 10 years to do so. Almost every one of our test queries had to be modified when moving from one DBMS to another. This underlines the need for a single astronomical query language (AQL) which is translated into the dialect of SQL handled by each DBMS. A recent Sourceforge project called Liberty Database Connectivity (LDBC) has much the same idea, with a translator from a standard query language to the appropriate dialect of SQL, but at present this package does not support the trigonometric functions that an AQL will need.
  • It was even more annoying that we needed to modify our "plain ASCII" data files each time, since there was no single representation of null values compatible with all these DBMS.
We have not yet had time to evaluate any of these products on parallel hardware, but have read the relevant documentation.

Sequential scans of large tables or substantial parts of them are bound to occur in any data mining facility, for a number of reasons already listed, although one would obviously try to avoid them whenever possible by creating of suitable indices.

Even if sequential scans only form a small proportion of all database accesses, the fact that they tend to take thousands (if not millions) of times longer, means that they may dominate the overall time for a typical sequence of data mining operations. Hence we felt it necessary to evaluate the effective speed (or bandwidth) of these DBMS when doing simple operations involving sequential access to a column of a large table. The results of computing the mean and standard deviation about the mean for a single column in a medium-sized table were surprisingly slow, when expressed in bandwidth terms. (Note: the figures for Postgres are as installed "out of the box" - some scope remains for tuning).

Software package Data Bandwidth
(Mbyte/s)
Postgres DBMS
0.3
MySQL DBMS
1.9
FITSIO row-orientated
7.0
FITSIO column-orientated
21.2

The last two rows report results using the same table (a 3.5 million row sample of the USNO-A2 catalog) converted to a FITS binary table, with a simple custom application to compute the simple statistics (there is an FTOOL which does this, but it computes more detailed statistics so would not have been exactly comparable). The last line reports an experiment in which the FITS file was set up with columns adjacent, rather than rows (as suggested by Prof Peter Buneman of Edinburgh).  This was indeed even more efficient, and gets fairly close to the raw speed of the disc drive.  We do not have exactly comparable figures for the commercial DBMS we tested, but they appeared to lie between the figures for Postgres and MySQL.

(7.4.4) Conclusions

A DBMS is needed to power the data warehouse, and to manage data within the MySpace domain.

Our basic evaluations are now complete, and may be inspected on the evaluation Wiki pages. As expected, there were no complete show-stoppers, the decision had to be made on the grounds of best overall suitability. Neither MySQL nor PostgreSQL had any built-in support for Beowulf clusters or farms of machines. SQL Server had the limitation that it could only be used on machines running the Microsoft Windows operating system, rarely found in clusters in the academic world. The choice beweeen DB2 and Oracle was not easy as both are large well-established packages. In the end we felt that DB2 was easier to install and manage, was more suitable for use on Beowulf-style (shared-nothing) clusters, and was essentially free for academic users through the IBM Scholars programme. DB2 v8.1 has been installed on a Beowulf cluster at Leicester and evaluations in this environment are continuing.

The fact that we have chosen DB2 for our internal use, does not affect the need for the VO to be able to access data archives no matter what DBMS they use.

ClivePage 2002 Dec 11.

Topic revision: r15 - 2002-12-19 - 17:20:22 - AndyLawrence
 
AstroGrid Service Click here for the
AstroGrid Service Web
This is the AstroGrid
Development Wiki

This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback