DBMS Evaluation Criteria

Clive Davenhall (acd@roe.ac.uk)
Version 1, 22 March 2002

Introduction

WP-A4 is planning to evaluate several DBMSs to determine whether they are suitable to use to store the astronomical catalogues and archives that will be accessible to AstroGrid and ultimately the VO. This document presents a first attempt at specifying the criteria to be used to evaluate each DBMS. The primary use of the DBMSs will be to store large, tabular datasets and the criteria reflect this emphasis. Relatively minor importance is attached to storing bulk data (images, spectra etc). A report should be written for each DBMS evaluated. The criteria divide into two categories:

Query Language

In a modern DBMS interaction with the stored tables is usually via a `query language', typically some variant of SQL. Often this query language will be the only mechanism for interacting with the database tables. Typically it will be used to: Because the query language is central to the use of a DBMS it features prominently in the following criteria.

Reports on the Candidate DBMS

Short reports about the candidate DBMS should be compiled under the following headings (which, for convenience, are further divided according to broad subject area).
  1. Functionality
    1. Query Language. It is almost a given that the query language will support arbitrary projections, selections and joins using quasi-algebraic expressions.
      • What query language is provided? If it is SQL, which version or level? The minimum level of SQL required is SQL-92. Any areas of SQL-92 which the DBMS does not support should be noted.
      • Are LEFT or FULL OUTER joins provided?
      • What mathematical functions are provided by the language (sin, cos, tan, log etc)?
      • Are any statistical functions provided (and if so, which)?
      • Can additional mathematical (or astronomical) functions be defined (for example. the great circle distance)?
    2. Importing and exporting tables. What external formats can be read and written, for example: HTML, XML and plain text. Can the import and export facilities be jury-rigged into reading and writing binary FITS tables directly, without writing conversion programs?
    3. Data types.
      • What data types are provided? At least the following data types are required: double precision real, single precision real, integer (four-byte), boolean and character string. Support for various lengths of short integer is also desirable.
      • Is the standard IEEE representation used for doubles and reals? If not, the DBMS should represent doubles to at least fifteen decimal digits.
      • Is there support for any of: dates and times, BLOBS, arrays and structures?
      • Is there any restriction on the maximum length of strings and is a VARCHAR (variable length string) mechanism provided?
    4. Null handling. Is there comprehensive support for null handling?
    5. Scalability. Are there any limits on: the number of rows, the number of columns, the size of a table or the number of tables? Can the DBMS handle files larger than the 2Gb limit on 32-bit operating systems (and if so, how)?
    6. Indexing What native indexing facilities available? In particular are any 2D indexing schemes provided? Can external indexing schemes, such as HTM or Healpix, be used?
    7. Metadata. What facilities does the DBMS provide for storing catalogue metadata? Include an explicit assessment of how the metadata for a typical astronomical catalogue would be represented using the DBMS. These metadata should be taken to include: descriptive header text, keywords and details for each column (units, comments etc).
    8. Interfaces and APIs. What external interfaces are provided (for example, JDBC or ODBC)? What APIs are provided and in which languages?
  2. Inter-operability
    1. Remote access. What features does the DBMS have to facilitate its remote invocation by Web or GRID processes. In particular, does it have any features which would prevent it being used in this way (which would be something of a show-stopper)?
    2. Distributed databases. Does the DBMS have any support for distributed databases?
    3. Parallelism. Is there any explicit support for parallelism?
  3. DBMS Management
    1. Security management. What facilities does the DBMS provide for indicating and controlling the ownership of individual tables.
    2. System installation. How much effort is required to install the DBMS?
    3. System management. Estimate the impact of the DBMS on the system management effort required on the host machine and also estimate the `database administration' effort required to run the system.
    4. System resources. Estimate the system resources required, in particular:
      • the size of the installed system,
      • the disk space required for a large catalogue (compared to the size of the same catalogue as a binary file),
      • the requirements for disk space, privileges and quotas for a simple user.
    5. Mapping to host file system. How are tables, databases and indices in the DBMS mapped to files in the host operating system? Can standard backup utilities be used or are proprietary utilities provided with the DBMS required?
  4. Market position
    1. Status of the supplier. What is the status of the supplier? What likely future developments are envisaged for the DBMS? Does the DBMS have significant usage in most of the countries likely to make a significant contribution to the VO?
    2. Cost. Including any academic discounts which may be available. The costs noted should include both any one-off purchase cost and any recurrent costs for support, maintenance or upgrades.
    3. Host platforms. On which platforms is the DBMS available?

Tests on the Candidate DBMS

The tests listed in this section are partly tests of functionality (it is possible that the DBMS might not be capable of running the tests at all), but are mostly tests of performance. That is, they are timing tests. Since the tests will necessarily be run on different computers at different institutions they cannot be more than indicative of relative performance.

I will create a standard catalogue on which to perform the tests. I would have thought that it would be adequate to run the tests on a large-ish catalogue of, say, 10**6 rows, rather than the enormous surveys of 10**8 rows that AstroGrid will ultimately be concerned with, at least for the initial set of evaluations. Similarly a row size of about twenty columns, with a mixture of data types, should be adequate. I'll create such a catalogue from one of the SuperCOSMOS surveys and make it available from my ftp area.

Importing
Import the catalogue into the DBMS from a text file.
Exporting
Write a copy of the catalogue to a text file.
Projection
Project a new column from the catalogue. The new column, D, should be computed from the existing columns according to the formula:
D = (A + log B) * C
where A, B and C are existing columns in the catalogue. Depending on the facilities of the DBMS the new column may either be added to the original catalogue or written as a new catalogue.
Selection
Select a subset from the catalogue according to the criteria:
A > B + log C
where A, B and C are again existing columns in the catalogue. This operation should be a simple, brute-force troll through the catalogue selecting on un-indexed columns. I'm unsure whether tests on indexed columns are also needed. The amount of work required could escalate rapidly, particularly if we tried to implement one of the external 2D indexing schemes to expedite cone searches.
Index creation
Create an index on (a) a real column and (b) a character column.
Join
Perform a simple equijoin and also, if possible, a great circle join based on pairing the celestial coordinates. I will provide example catalogues to use in the joins.
Data mining
Compute the variance on a column (as a simple example of the sort of operation required in data mining operations).