DBMS Evaluation Criteria
Clive Davenhall (acd@roe.ac.uk)
Version 1, 22 March 2002
WP-A4 is planning to evaluate several DBMSs to determine whether they
are suitable to use to store the astronomical catalogues and archives
that will be accessible to AstroGrid and ultimately the VO. This
document presents a first attempt at specifying the criteria to be used
to evaluate each DBMS. The primary use of the DBMSs will be to store
large, tabular datasets and the criteria reflect this emphasis.
Relatively minor importance is attached to storing bulk data (images,
spectra etc). A report should be written for each DBMS evaluated. The
criteria divide into two categories:
- a set of topics which should be addressed in the report,
- a set of tests which should be performed on the DBMS. The results
of these tests should also be included in the report.
In a modern DBMS interaction with the stored tables is usually via a
`query language', typically some variant of SQL. Often this query
language will be the only mechanism for interacting with the database
tables. Typically it will be used to:
- project (create) new columns,
- perform selections,
- execute joins of two (or more) tables,
- generate reports.
Because the query language is central to the use of a DBMS it features
prominently in the following criteria.
Short reports about the candidate DBMS should be compiled under the following
headings (which, for convenience, are further divided according to broad
subject area).
- Functionality
- Query Language. It is almost a given that the query
language will support arbitrary projections, selections and joins using
quasi-algebraic expressions.
- What query language is provided? If it is SQL, which version or
level? The minimum level of SQL required is SQL-92. Any areas of
SQL-92 which the DBMS does not support should be noted.
- Are LEFT or FULL OUTER joins provided?
- What mathematical functions are provided by the language (sin,
cos, tan, log etc)?
- Are any statistical functions provided (and if so, which)?
- Can additional mathematical (or astronomical) functions be
defined (for example. the great circle distance)?
- Importing and exporting tables. What external formats can
be read and written, for example: HTML, XML and plain text. Can the
import and export facilities be jury-rigged into reading and writing
binary FITS tables directly, without writing conversion programs?
- Data types.
- What data types are provided? At least the following data types
are required: double precision real, single precision real, integer
(four-byte), boolean and character string. Support for various
lengths of short integer is also desirable.
- Is the standard IEEE representation used for doubles and reals?
If not, the DBMS should represent doubles to at least fifteen decimal
digits.
- Is there support for any of: dates and times, BLOBS, arrays
and structures?
- Is there any restriction on the maximum length of strings and
is a VARCHAR (variable length string) mechanism provided?
- Null handling. Is there comprehensive support for null
handling?
- Scalability. Are there any limits on: the number of rows,
the number of columns, the size of a table or the number of tables?
Can the DBMS handle files larger than the 2Gb limit on 32-bit operating
systems (and if so, how)?
- Indexing What native indexing facilities available? In
particular are any 2D indexing schemes provided?
Can external indexing schemes, such as HTM or Healpix, be used?
- Metadata. What facilities does the DBMS provide for storing
catalogue metadata? Include an explicit assessment of how the metadata
for a typical astronomical catalogue would be represented using the
DBMS. These metadata should be taken to include: descriptive header
text, keywords and details for each column (units, comments etc).
- Interfaces and APIs. What external interfaces are provided
(for example, JDBC or ODBC)? What APIs are provided and in which
languages?
- Inter-operability
- Remote access. What features does the DBMS have to
facilitate its remote invocation by Web or GRID processes. In
particular, does it have any features which would prevent it being
used in this way (which would be something of a show-stopper)?
- Distributed databases. Does the DBMS have any support
for distributed databases?
- Parallelism. Is there any explicit support for parallelism?
- DBMS Management
- Security management. What facilities does the DBMS provide
for indicating and controlling the ownership of individual tables.
- System installation. How much effort is required to install
the DBMS?
- System management. Estimate the impact of the DBMS on the
system management effort required on the host machine and also estimate
the `database administration' effort required to run the system.
- System resources. Estimate the system resources required,
in particular:
- the size of the installed system,
- the disk space required for a large catalogue (compared to
the size of the same catalogue as a binary file),
- the requirements for disk space, privileges and quotas for a
simple user.
- Mapping to host file system. How are tables, databases
and indices in the DBMS mapped to files in the host operating system?
Can standard backup utilities be used or are proprietary utilities
provided with the DBMS required?
- Market position
- Status of the supplier. What is the status of the supplier?
What likely future developments are envisaged for the DBMS? Does the
DBMS have significant usage in most of the countries likely to
make a significant contribution to the VO?
- Cost. Including any academic discounts which may be
available. The costs noted should include both any one-off purchase
cost and any recurrent costs for support, maintenance or upgrades.
- Host platforms. On which platforms is the DBMS available?
The tests listed in this section are partly tests of functionality (it is
possible that the DBMS might not be capable of running the tests at all),
but are mostly tests of performance. That is, they are timing tests.
Since the tests will necessarily be run on different computers at
different institutions they cannot be more than indicative of relative
performance.
I will create a standard catalogue on which to perform the tests. I would
have thought that it would be adequate to run the tests on a large-ish
catalogue of, say, 10**6 rows, rather than the enormous surveys of 10**8
rows that AstroGrid will ultimately be concerned with, at least for the
initial set of evaluations. Similarly a row size of about twenty columns,
with a mixture of data types, should be adequate. I'll create such a
catalogue from one of the SuperCOSMOS surveys and make it available from
my ftp area.
- Importing
- Import the catalogue into the DBMS from a text file.
- Exporting
- Write a copy of the catalogue to a text file.
- Projection
- Project a new column from the catalogue. The new column, D,
should be computed from the existing columns according to the formula:
D = (A + log B) * C
where A, B and C are existing columns in the catalogue. Depending on
the facilities of the DBMS the new column may either be added to the
original catalogue or written as a new catalogue.
- Selection
- Select a subset from the catalogue according to the criteria:
A > B + log C
where A, B and C are again existing columns in the catalogue. This
operation should be a simple, brute-force troll through the catalogue
selecting on un-indexed columns. I'm unsure whether tests on indexed
columns are also needed. The amount of work required could escalate
rapidly, particularly if we tried to implement one of the external 2D
indexing schemes to expedite cone searches.
- Index creation
- Create an index on (a) a real column and (b) a character column.
- Join
- Perform a simple equijoin and also, if possible, a great circle
join based on pairing the celestial coordinates. I will provide
example catalogues to use in the joins.
- Data mining
- Compute the variance on a column (as a simple example of the sort
of operation required in data mining operations).