dmf
Data Mining Facility
Author: Clive Page
Version: 1.02 (with a few typos corrected)
Date: 2003 January 20
1. Introduction
In the last few years a number of large source catalogues have become available
from sky surveys and from systematic analysis of sets of individual observations.
The largest of these arise from optical and near infra-red surveys which
have generated tabular datasets of around 10
9 rows. A few examples are shown here:
Over the next
few years similar or larger datasets will become available from the Sloan
Digital Sky Survey, the UKIRT-WFCAM surveys, e-MERLIN, VISTA,
and other projects. These represent a large collection of tabular datasets
containing a wealth of scientific information just waiting to be extracted
and exploited.
Unfortunately the tools for manipulating tabular datasets, especially ones
of this size, are sadly lacking. I have attempted to list in section
2 the required functions, and it is true that nearly all of these functions
are available in one or more existing software packages (including database
management systems, DBMS) but there is no single software package which supports
more than a few of the required operations. Using two or more packages
can be very difficult as they exhibit a great diversity of user interfaces,
and they also require the datasets to be converted from one format to another.
Such conversions are time-consuming and usually loses important metadata.
I think AstroGrid should provide an integrated data mining facility to
fill this gap.
In section 2 of this document I have outlined what I feel to be the principal
operations which need to be provided, in section 3 I examine the functionality
available in current software, and in section 4 discuss some implementation
issues.
Tabular datasets arise in many other contexts in astronomy, for example
as time-series (light-curves), spectra, and indeed photon-event lists from
high-energy instruments. They are also important in research in solar
and solar-terrestrial physics. The functionality outlined here is,
I believe, sufficiently general-purpose to be useful in most of these other
areas as well.
The term
data mining has no clear
definition, except that it involves the extraction of useful information from
large volumes of data. Scientific data mining, I think, involves two
distinct phases:
- Data exploration, in which the scientist searches for patterns, or
correlations, or interesting exceptional cases by sifting through the bulk
data, using a range of fairly standardised operations.
- Bulk processing: this is likely to be a batch process perhaps composed
of a sequence of steps discovered during the exploration phase, or if this
turns out to be too slow, may involve specially written software.
It is in the data exploration phase that I think
AstroGrid software should
be able to play a leading role. One important operation is that of
data cleaning: the removal of spurious
data points, errors of observation, and the like. Experience in the
commercial world suggests that over half the effort in data mining is
typically devoted to obtaining "clean" datasets; the number of things that
can and do go wrong in astronomical observations suggests that it will be
at least as important activity in astronomical data mining. Most of
the work of cleaning is also an interactive operation, so as to discover which
points are spurious, and what characteristics they have that will enable them
to be identified and removed when doing the bulk processing.
2. Functional Requirements
This section sets out first the general system requirements, and then lists
the principal functions or operations to be supported.
2.1 General requirements
- Interactive or batch use: Data
exploration needs to be carried out interactively as far as possible. Since
some operations on large tables will be slow, there also needs to be a semi-interactive
mode, in which an operation is started interactively, but it continues in
the background, the user being notified of completion. There is also
likely to be a need for the user to be able to chain together a sequence of
operations found to be successful during the exploration phase, to run as
a batch job during the production phase. It is desirable to provide
a way of doing this by editing a script gathered from the commands executed
interactively.
-
Logging: commands need to be logged to
allow replay (as a script) and also so that the sequence of operations producing
an output dataset can be recorded in its metadata (e.g. as HISTORY records in a FITS file).
- Default dataset: since most
operations will apply just to a single table, and that table will often be
used in many sequential operations, it is desirable to have a way of explicitly
opening a table, generating a persistent connection. This simplifies
interactive use, because there is no need to name the current table in each
command, and supports the semi-interactive or batch usage, which needs a
persistent connection identifier.
- Subsettting: data exploration
often proceeds by stepwise refinement. For example the user may select
some subset of a table by choosing the range of some of its columns, and
then examine it, perhaps producing a plot or a histogram, perhaps sorting
it into ascending or descending order of some column, and then may select
a new subset. There will often be a need to back-track from sequences
of operations which reach a scientific dead-end, and undo the previous few
operations. This mode of step-wise working has been supported by a
few previous software systems (for example the BROWSE database currently
supported by HEASARC, and the now obsolete DBMS called System1032). Many
DBMS support the creation of a VIEW which is somewhat similar, but it is
not always possible to create another view from an existing view. This
way of working should be supported by the proposed data mining facility:
by default all operations are on the current
dataset which is initially the whole table, but will be some subset
of it after one or more SELECT
or similar operations have been applied.
-
Read/write access: the facility
will, of course, only allow users read-access to standard catalogues, but
certain write/update operations need to be supported on users' own tables
and on subsets of read-only tables which they generate themselves and save
for future use.
- Slow operations: some operations
will be index-assisted and thus fast, others which require the sequential
scan of a terabyte dataset will be very slow: the user needs to be notified
of any operation which will be unexpectedly slow, and allowed to run it in
the background or as a batch job, with notification of completion.
2.2 Dataset housekeeping
Open a table: this selects the whole
of an existing tabular dataset as the default for subsequent operations. The
main option is to specify whether the table is opened
readonly or whether writes and updates
are permitted (subject to the required ownership rights, of course).
Close the table: required to match
the open operation (housekeeping operation).
List available tables: this
should display the names and basic properties of the tables available locally,
both system-provided and personally generated.
Drop (or maybe
undo): this drops the currently selected
dataset or undoes the operation of the preceding processing stage (housekeeping
function).
Show the datasets in the stack
of subsets from earlier operations (housekeeping function).
Describe: this displays the metadata
of the current table where provided (FITS, VOTable, etc.). It obviously
is not possible to do this with tables in a DBMS where there are no practically
no metadata except the raw list of column names and data types.
Change the name or units of
a column (subject to access restrictions).
Import external table: formats to
be supported (if these are not the internal internal formats) should include:
FITS, VOTable, TSV, fixed-format text tables.
Export the current dataset to a
file in the specified format; those to be supported should include: FITS,
VOTable, TSV, fixed-format text, HTML, Latex tables.
Create a new table interactively:
user specifies the column names and data types, and then may enter data, as
a simple way of generating a new small tabular dataset.
2.3 Basic table operations
Browse: this allows the user
to scroll through the current dataset in row order and, if the table is too
wide to fit on the screen, allows the selection of columns. A spreadsheet-type
display is desirable, with the ability to jump to specified row postions in
the table, and also start and end of it. It is also important to give
the user control over the display formats of numbers: the user may want floating-point
numbers displayed in a fixed-point style with some sensible number of digits
and not just in exponential form (as in so many DBMS); integers representing
bit-maps should be displayable in binary or hexadecimal format; angular quantities
such as RA and DEC must be displayable in sexagesimal form (hours:minutes:seconds),
and so on.
Select: this is essentially the
SELECT command from SQL, which
allows a subset to be chosen based on the properties of individual columns.
There is a need, I think, both for a simple forms-based approach (select
range or value) but this gets difficult if individual elements need to be
combined with
AND and
OR operations, for which it is desirable
to allow a more complex expression-based selection (as in SQL, Xquery, etc.).
The new subset should become the default dataset for subsequent operations.
Obviously indexes have to be used to speed up select operations, if available.
Sort: this sorts the current dataset in
ascending (optionally descending) order, generating a new dataset of the same
length (except that the rows with null values in the sort column will vanish).
Find Extrema: in many cases the
user is only interested in seeing or selecting the top N (or bottom N) rows
in a table. where N is much smaller than the total number of rows,
there are much faster ways of finding the extrema than by sorting the whole
table, so this operation should be available separately.
Project (compute) new columns as
a function of values in existing columns (and/or metadata values). Some
standard functions need to be provided for common coordinate manipulations,
e.g. precessicang
(RA,DEC) to
another epoch, or computing galactic longitude and latitude.
Statistics: compute simple
statistics on one or more columns of the current dataset (e.g. count of non-null
values, min, max, mean, standard-deviation, median)
Regression: compute linear regression
coefficient on a pair of columns.
Group values in an integer/string
column over common values (as in the SQL
GROUP BY operation).
Histogram: produce a histogram of
values in a single column, either as another tabular dataset or graphically
on the screen. Range and bin-size should be selectable, with sensible
defaults. It is probably desirable to support 2-d or 3-d binning to
produce a data grid or cube, with appropriate visualisation.
Find first or next occurrence of
a given value (or maybe a value in a given range) in a specified column in
the current dataset: clearly an SQL-style
SELECT will find all such values,
but sometimes it is desirable to merely scroll to the part of the table where
such a value is located, rather than generate a new subset.
Delete (or mark as deleted) a specified
subset of rows (subject to access restrictions) producing a new dataset; this
allows an undelete to be provided via the drop or undo command.
Create an index for a specified
column or columns, so as to speed-up subsequent selection operations (as
in SQL
CREATE INDEX). It
may be necessary to support both B-tree and R-tree or other spatial indices.
Update values in an existing column
by reference to a formula (e.g. re-compute a hardness ratio as the ratio of
two flux columns). This is subject to write access to the table.
Equi-join: join the table with another
named table on equality of values in corresponding or named columns - this
is an equijoin as supported by SQL systems. on columns of integers or character
strings. Options shold include an outer join.
Fuzzy join - as above but for cross-matching
two catalogues based on the overlap of error-circles. This may need
one or more spatial indices to have been created on their RA/DEC coordinates.
Options should include at least a left outer join (to identifiy unmatched
sources), and a self-join (to find sources close together).
Append new rows to a table (to which
the user has write-access) by entering data interactively, or by concatenating
two existing tables with compatible structures.
Split a table into two tables according
to whether the values in some column are above or below a specified
threshold. This could be done by successive basic operations (select,
save, undo, re-select, save again) but this is awkward.
Plot values of two (or more) columns:
the facilities should include basic X-Y graphics and overlaying of points
on existing images (given suitable coordinate frame metadata). The range
of graphical output that it might be desirable to produce is rather large,
and merges eventually into the field of data visualisation. While it
is essential to integrate basic graphical facilities with the table-handling
operations, I think one has to accept that the more advanced visualisation
facilities will have to be provided by exporting the data and using a specialised
package.
Edit a table (to which the user
has write access) interactively, e.g. to correct individual invalid values.
Sample every Nth row to generate
a new dataset which is much smaller but has similar overall properties to
the parent: it may be desirable to allow the selection of a random subset.
Resample: rebin a given column with
a new bin-size using nearest-neighbour or an interpolation method.
Differ: find differences between
one or more columns of two tables.
Time-series functions: it is likely
that some additional time-series functionality will be needed, such as computing
running means, finding peaks, taking Fourier transforms, and so on, but the
exact set of operations that can be included in a general data mining facility
needs further discussion, I feel.
3. Existing Software
3.1 On-line Services
Vizier (one of the CDS facilities
with several mirrors) has the world's largest collection of tabular datasets,
mainly source catalogues. Its on-line facilities include the selection
of subsets from one or more tables, with output in FITS, TSV, or XML format.
It is possible to upload a list of constraints in a text file, or a
list of sources to be cross-matched. There is an output limit of 9999
rows.
W3browse is an on-line facility
from HEASARC (NASA/GSFC) which allows two catalogues from their pre-defined
set to be cross-matched on spatial coordinates. It is also possible
to perform an anti-correlation to find sources which do not have counterparts
in the second catalogue, and to correlate on time of observation.
SkyServer is a tool for cross-matching
the Sloan Digital Sky Survey data releases with the FIRST and 2MASS catalogues.
It is designed to work on relatively small areas of sky, and has limits
on the volume of output for users outside the project.
The on-line catalogue access services have a common limitation that they
are really only designed for accessing data from a small area of sky. Investigations
which require a scan of the whole sky for, say a given class of source, are
simply not feasible using any of these services.
3.2 Software Packages
I have attempted to analyse the table-handing functionality in what I think
are the four most widely-used astronomical software collections:
- FTOOLS is a collection of utilities for handling FITS images and tables,
part of HEASOFT from NASA/GSFC. The
fv program is controlled by a GUI,
all others are command-line tasks.
- CURSA is a package written by Clive Davenhall for handling astronomical
catalogues stored in FITS tables (and other formats); it is part of the Starlink
Software Collection. The xcatview
task is run from a GUI, the others
are command-line tasks, using the Starlink parameter system.
- STSDAS-TABLES was written at STScI as an add-on to the well-known
IRAF package from NOAO. STSDAS TABLES package is a set of command-line
programs which operate on tables stored in their own internal format, but
which can be in either row or column-major order. There are import/export
utilities, but only for text formats, and Tex/Latex tables, not FITS.
- MIDAS from ESO has many tasks for table manipulation: MIDAS has its
own internal table format (column-major) but can import and export FITS tables.
All are command-line tasks.
The following table attempts to list, for each operation listed in section
2 above, whether the function is supported by these four packages, and by
a DBMS suppporting standard SQL. Where a cell is blank it means that
as far as I can find out the corresponding function is not provided (at least
directly). In some cases there are functional overlaps or partial functionality
which are hard to display in a simple tablular form.
|
FTOOLS
|
CURSA
|
STSDAS
|
MIDAS |
DBMS with SQL
|
browse
|
fv
|
xcatview
|
|
|
|
select
|
fselect
|
catselect
|
tselect
|
select/table
|
select
|
sort
|
fsort, fv
|
catsort
|
tsort
|
sort/table
|
select + order by
|
extrema
|
|
|
|
|
select .. top n
|
project
|
|
xcatview
|
tproject
|
compute/table
|
select
|
statistics
|
fstatistic
|
xcatview
|
tstat
|
statistics/table
|
select + avg()
|
regression
|
|
|
tlinear
|
compute/regression
|
|
group
|
|
|
|
|
select + group by
|
histogram
|
fhisto
|
catgrid
|
thistogram
|
compute/histogram
|
|
find
|
fv
|
|
|
|
|
delete
|
fdelrow, fv
|
|
|
|
delete
|
create
index
|
findex
|
|
|
|
create index
|
update
|
fcalc
|
|
|
|
select
|
equi-join
|
|
|
tjoin
|
|
select + join
|
fuzzy
join
|
|
catpair
|
|
join/table
|
(select + join)
|
append
|
fmerge
|
|
|
merge/table
|
insert
|
split
|
|
catselect
|
|
|
|
plot
|
fplot
|
catchart
|
gtedit
|
plot/table
|
|
edit
|
fv
|
|
gtedit
|
edit/table
|
|
sample
|
|
catselect
|
|
|
|
resample
|
|
|
tsample
|
rebin/table
|
|
differ
|
fdiff (?)
|
|
tdiffer
|
|
|
It is important to realise that none of
these four packages is designed to handle anything more than fairly modest-sized
tables: in particular none have any general concept of indexed access. The
catpair operation of CURSA relies on the second table being sorted in ascending
order of one positional coordinate; the
JOIN/TABLE operation of MIDAS generates
an R-tree internally but does not preserve it.
Although there is a FTOOL
to create an index, none of the other tasks is capable of making use of it.
There are also many statistical and visualisation packages available, which
offer some of these functions. As far as I can find out, every single
one of them is designed to operate on datasets which can be stored in the
central memory of a computer: perhaps this is not surprising, since to a statistician
a collection of 100,000 points is a very large dataset indeed. Although
memory prices are still falling, and workstations with a gigabyte of memory
are no longer exceptional, we are still some years away from being able to
hold a table of a billion rows in memory. This means that memory-based
packages are currently unsuitable.
4. Implementation Notes
Both FTOOLS and CURSA are built using the cFITSIO library, which has network
access built in: both of these packages can therefore access FITS files on
remote machines using either FTP or HTTP protocols. The way this works,
however, is that the whole of the remote dataset is copied to local scratch
storage. The more advanced data access methods in OGSA-DAI may allow
a finer-grained access to remote data.
None of the existing software packages looks attractive as a basis for the
desired data mining facility, as they are really not designed to access the
large tables as are now becoming commonplace. To the FITS tasks one
would need to add indexing and a better GUI.
Tabular Data Infrastructure
If a DBMS were used as the basis of the data mining facility, the main problem
areas would be:
- Lack of support for metadata such as physical dimensions of columns,
and table-wide metadata.
- Lack of any graphical facilities at all, or of efficient export formats
for connecting to external packages.
- Poor support for interactive facilities. Some DBMS have the
concept of a VIEW which is a subset, but it is not clear whether this is
flexible enough for what is proposed.
- An additional software layer would be needed to provide a persistent
connection for the semi-interactive and batch facilities.
- DBMS do not appear to be very efficient at sequential data access:
our tests suggest that such operations are cpu-bound rather than I/O bound
as one would expect.
- Only a few DBMS support parallel hardware such as Beowulf clusters
(e.g. DB2, SQL Server).
If the data mining facility were to be built upon FITS binary tables, the
main problem areas would be:
- Indexing would have to be added (not difficult, prototypes exist).
- FITS tables of more than 231 bytes (2 Gbyte) are possible,
but only with a file system supporting large files, and a C compiler supporting
64-bit integers. Use of FITS tables with more than 231rows
appears to be consistent with the Standard, but is not supported by the current
cFITSIO library. The largest current catalogues are smaller than this,
but only with a factor of 2 in hand.
- There is no built-in support for parallel hardware; it would be necessary
to split up large tables into chunks distributed over the set of nodes.
User Interface
Whatever the underlying data access mechanism, a graphical user interface
would be needed. The best basis for this would probably the Java Swing
package, which has good support for spreadsheet-style tabular displays. For
the more complex operations some sort of query language is needed: this has
been the subject of a number of discussions, and something based on SQL SELECTs
or Xquery syntax is an obvious choice, but many details remain to be resolved.
Client Side or Server Side?
It is not clear whether the data mining facility could best be provided as
a stand-along client program, or as server-side software. I have deliberately
written the functional requirements in section 2 to be independent of this
decision.
Providing the data mining facility on the server which also holds large standard
catalogues would obviously make them more accessible, but would make it more
difficult for users to use the package with their own tables, as these would
have to be explicitly uploaded. The
MySpace concept may help, but I
forsee problems in persuading users to leave their valuable datasets for
long periods on some unknown server elsewhere "on the grid".
Providing a software package which runs on the user's own machine would simplify
access to the smaller datasets, and improve response time for simple interactive
operations, but this would generally be inefficient for access to large remote
catalogues.
--
ClivePage - 20 Jan 2003
The comments originally appended here from Clive Davehall, Patricio Ortiz, and Bob Mann have now been transferred to the
Astrogrid Forum site where it will be easier to continue the discussion. I hope nothing got lost in the transfer.