dmf

Data Mining Facility

Author:    Clive Page

Version:    1.02 (with a few typos corrected)
Date:    2003 January 20

1. Introduction

In the last few years a number of large source catalogues have become available from sky surveys and from systematic analysis of sets of individual observations. The largest of these arise from optical and near infra-red surveys which have generated tabular datasets of around 109 rows. A few examples are shown here:

Catalogue Millions of rows
USNO-B 1046
SuperCOSMOS >600
APM 500
2MASS 162
DENIS 17
SLOAN (EDR) 10
Tycho 1
FIRST 0.7
Hipparcos 0.1
ROSAT 0.1
XMM-Newton 0.1
near future  
WFCAM ~350

Over the next few years similar or larger datasets will become available from the Sloan Digital Sky Survey, the  UKIRT-WFCAM surveys,  e-MERLIN, VISTA, and other projects.  These represent a large collection of tabular datasets containing a wealth of scientific information just waiting to be extracted and exploited.


Unfortunately the tools for manipulating tabular datasets, especially ones of this size, are sadly lacking.  I have attempted to list in section 2 the required functions, and it is true that nearly all of these functions are available in one or more existing software packages (including database management systems, DBMS) but there is no single software package which supports more than a few of the required operations.  Using two or more packages can be very difficult as they exhibit a great diversity of user interfaces, and they also require the datasets to be converted from one format to another.  Such conversions are time-consuming and usually loses important metadata.  

I think AstroGrid should provide an integrated data mining facility to fill this gap.
 

In section 2 of this document I have outlined what I feel to be the principal operations which need to be provided, in section 3 I examine the functionality available in current software, and in section 4 discuss some implementation issues.  

Tabular datasets arise in many other contexts in astronomy, for example as time-series (light-curves), spectra, and indeed photon-event lists from high-energy instruments.  They are also important in research in solar and solar-terrestrial physics.  The functionality outlined here is, I believe, sufficiently general-purpose to be useful in most of these other areas as well.

The term data mining has no clear definition, except that it involves the extraction of useful information from large volumes of data.  Scientific data mining, I think, involves two distinct phases:

  1. Data exploration, in which the scientist searches for patterns, or correlations, or interesting exceptional cases by sifting through the bulk data, using a range of fairly standardised operations.
  2. Bulk processing: this is likely to be a batch process perhaps composed of a sequence of steps discovered during the exploration phase, or if this turns out to be too slow, may involve specially written software.
It is in the data exploration phase that I think AstroGrid software should be able to play a leading role.  One important operation is that of data cleaning: the removal of spurious data points, errors of observation, and the like.  Experience in the commercial world suggests that over half the effort in data mining  is typically devoted to obtaining "clean" datasets; the number of things that can and do go wrong in astronomical observations suggests that it will be at least as important activity in astronomical data mining.   Most of the work of cleaning is also an interactive operation, so as to discover which points are spurious, and what characteristics they have that will enable them to be identified and removed when doing the bulk processing.


2. Functional Requirements

This section sets out first the general system requirements, and then lists the principal functions or operations to be supported.

2.1 General requirements

  1. Interactive or batch use: Data exploration needs to be carried out interactively as far as possible.  Since some operations on large tables will be slow, there also needs to be a semi-interactive mode, in which an operation is started interactively, but it continues in the background, the user being notified of completion.  There is also likely to be a need for the user to be able to chain together a sequence of operations found to be successful during the exploration phase, to run as a batch job during the production phase.  It is desirable to provide a way of doing this by editing a script gathered from the commands executed interactively.
  2. Logging: commands need to be logged to allow replay (as a script) and also so that the sequence of operations producing an output dataset can be recorded in its metadata (e.g. as HISTORY records in a FITS file).
  3. Default dataset: since most operations will apply just to a single table, and that table will often be used in many sequential operations, it is desirable to have a way of explicitly opening a table, generating a persistent connection.  This simplifies interactive use, because there is no need to name the current table in each command, and supports the semi-interactive or batch usage, which needs a persistent connection identifier.
  4. Subsettting: data exploration often proceeds by stepwise refinement.  For example the user may select some subset of a table by choosing the range of some of its columns, and then examine it, perhaps producing a plot or a histogram, perhaps sorting it into ascending or descending order of some column, and then may select a new subset.  There will often be a need to back-track from sequences of operations which reach a scientific dead-end, and undo the previous few operations.  This mode of step-wise working has been supported by a few previous software systems (for example the BROWSE database currently supported by HEASARC, and the now obsolete DBMS called System1032).  Many DBMS support the creation of a VIEW which is somewhat similar, but it is not always possible to create another view from an existing view.  This way of working should be supported by the proposed data mining facility: by default all operations are on the current dataset which is initially the whole table, but will be some subset of it after one or more SELECT

    or similar operations have been applied.

  5. Read/write access:  the facility will, of course, only allow users read-access to standard catalogues, but certain write/update operations need to be supported on users' own tables and on subsets of read-only tables which they generate themselves and save for future use.
  6. Slow operations: some operations will be index-assisted and thus fast, others which require the sequential scan of a terabyte dataset will be very slow: the user needs to be notified of any operation which will be unexpectedly slow, and allowed to run it in the background or as a batch job, with notification of completion.

2.2 Dataset housekeeping

Open a table: this selects the whole of an existing tabular dataset as the default for subsequent operations.  The main option is to specify whether the table is opened readonly or whether writes and updates are permitted (subject to the required ownership rights, of course).

Close the table: required to match the open operation (housekeeping operation).

List available tables:  this should display the names and basic properties of the tables available locally, both system-provided and personally generated.

Drop (or maybe undo): this drops the currently selected dataset or undoes the operation of the preceding processing stage (housekeeping function).

Show the datasets in the stack of subsets from earlier operations (housekeeping function).


Describe: this displays the metadata of the current table where provided (FITS, VOTable, etc.).  It obviously is not possible to do this with tables in a DBMS where there are no practically no metadata except the raw list of column names and data types.

Change  the name or units of a column (subject to access restrictions).

Import external table: formats to be supported (if these are not the internal internal formats) should include: FITS, VOTable, TSV, fixed-format text tables.


Export the current dataset to a file in the specified format; those to be supported should include: FITS, VOTable, TSV, fixed-format text, HTML, Latex tables.

Create a new table interactively: user specifies the column names and data types, and then may enter data, as a simple way of generating a new small tabular dataset.

2.3 Basic table operations

Browse:  this allows the user to scroll through the current dataset in row order and, if the table is too wide to fit on the screen, allows the selection of columns.  A spreadsheet-type display is desirable, with the ability to jump to specified row postions in the table, and also start and end of it.  It is also important to give the user control over the display formats of numbers: the user may want floating-point numbers displayed in a fixed-point style with some sensible number of digits and not just in exponential form (as in so many DBMS); integers representing bit-maps should be displayable in binary or hexadecimal format; angular quantities such as RA and DEC must be displayable in sexagesimal form (hours:minutes:seconds), and so on.

Select: this is essentially the SELECT command from SQL, which allows a subset to be chosen based on the properties of individual columns.

There is a need, I think, both for a simple forms-based approach (select range or value) but this gets difficult if individual elements need to be combined with AND and OR operations, for which it is desirable to allow a more complex expression-based selection (as in SQL, Xquery, etc.). The new subset should become the default dataset for subsequent operations. Obviously indexes have to be used to speed up select operations, if available.

Sort: this sorts the current dataset in ascending (optionally descending) order, generating a new dataset of the same length (except that the rows with null values in the sort column will vanish).

Find Extrema: in many cases the user is only interested in seeing or selecting the top N (or bottom N) rows in a table.  where N is much smaller than the total number of rows, there are much faster ways of finding the extrema than by sorting the whole table, so this operation should be available separately.

Project (compute) new columns as

a function of values in existing columns (and/or metadata values).  Some standard functions need to be provided for common coordinate manipulations, e.g. precessicang (RA,DEC) to another epoch, or computing galactic longitude and latitude.

Statistics:  compute simple statistics on one or more columns of the current dataset (e.g. count of non-null values, min, max, mean, standard-deviation, median)

Regression: compute linear regression coefficient on a pair of columns.

Group values in an integer/string column over common values (as in the SQL GROUP BY operation).


Histogram: produce a histogram of values in a single column, either as another tabular dataset or graphically on the screen.  Range and bin-size should be selectable, with sensible defaults.  It is probably desirable to support 2-d or 3-d binning to produce a data grid or cube, with appropriate visualisation.

Find first or next occurrence of a given value (or maybe a value in a given range) in a specified column in the current dataset: clearly an SQL-style SELECT will find all such values, but sometimes it is desirable to merely scroll to the part of the table where such a value is located, rather than generate a new subset.


Delete (or mark as deleted) a specified subset of rows (subject to access restrictions) producing a new dataset; this allows an undelete to be provided via the drop or undo command.

Create an index for a specified column or columns, so as to speed-up subsequent selection operations (as in SQL CREATE INDEX).  It may be necessary to support both B-tree and R-tree or other spatial indices.


Update values in an existing column by reference to a formula (e.g. re-compute a hardness ratio as the ratio of two flux columns).  This is subject to write access to the table.

Equi-join: join the table with another named table on equality of values in corresponding or named columns - this is an equijoin as supported by SQL systems. on columns of integers or character strings.  Options shold include an outer join.

Fuzzy join - as above but for cross-matching two catalogues based on the overlap of error-circles.  This may need one or more spatial indices to have been created on their RA/DEC coordinates. Options should include at least a left outer join (to identifiy unmatched sources), and a self-join (to find sources close together).

Append new rows to a table (to which the user has write-access) by entering data interactively, or by concatenating two existing tables with compatible structures.

Split a table into two tables according to whether the values in some  column are above or below a specified threshold.  This could be done by successive basic operations (select, save, undo, re-select, save again) but this is awkward.


Plot values of two (or more) columns: the facilities should include basic X-Y graphics and overlaying of points on existing images (given suitable coordinate frame metadata).  The range of graphical output that it might be desirable to produce is rather large, and merges eventually into the field of data visualisation.  While it is essential to integrate basic graphical facilities with the table-handling operations, I think one has to accept that the more advanced visualisation facilities will have to be provided by exporting the data and using a specialised package.

Edit a table (to which the user has write access) interactively, e.g. to correct individual invalid values.

Sample every Nth row to generate a new dataset which is much smaller but has similar overall properties to the parent: it may be desirable to allow the selection of  a random subset.

Resample: rebin a given column with a new bin-size using nearest-neighbour or an interpolation method.

Differ: find differences between one or more columns of two tables.


Time-series functions: it is likely that some additional time-series functionality will be needed, such as computing running means, finding peaks, taking Fourier transforms, and so on, but the exact set of operations that can be included in a general data mining facility needs further discussion, I feel.

3. Existing Software

3.1 On-line Services

Vizier (one of the CDS facilities with several mirrors) has the world's largest collection of tabular datasets, mainly source catalogues.  Its on-line facilities include the selection of subsets from one or more tables, with output in FITS, TSV, or XML format.

 It is possible to upload a list of constraints in a text file, or a list of sources to be cross-matched.  There is an output limit of 9999 rows.

W3browse is an on-line facility from HEASARC (NASA/GSFC) which allows two catalogues from their pre-defined set to be cross-matched on spatial coordinates.  It is also possible to perform an anti-correlation to find sources which do not have counterparts in the second catalogue, and to correlate on time of observation.

SkyServer is a tool for cross-matching the Sloan Digital Sky Survey data releases with the FIRST and 2MASS catalogues.

It is designed to work on relatively small areas of sky, and has limits on the volume of output for users outside the project.
The on-line catalogue access services have a common limitation that they are really only designed for accessing data from a small area of sky. Investigations which require a scan of the whole sky for, say a given class of source, are simply not feasible using any of these services.

3.2 Software Packages


I have attempted to analyse the table-handing functionality in what I think are the four most widely-used astronomical software collections:

  • FTOOLS is a collection of utilities for handling FITS images and tables, part of HEASOFT from NASA/GSFC.  The fv program is controlled by a GUI, all others are command-line tasks.  
  • CURSA is a package written by Clive Davenhall for handling astronomical catalogues stored in FITS tables (and other formats); it is part of the Starlink Software Collection.  The xcatview task is run from a GUI, the others are command-line tasks, using the Starlink parameter system.
  • STSDAS-TABLES was written at STScI as an add-on to the well-known IRAF package from NOAO.  STSDAS TABLES package is a set of  command-line programs which operate on tables stored in their own internal format, but which can be in either row or column-major order.  There are import/export utilities, but only for text formats, and Tex/Latex tables, not FITS.
  • MIDAS from ESO has many tasks for table manipulation: MIDAS has its own internal table format (column-major) but can import and export FITS tables. All are command-line tasks.
The following table attempts to list, for each operation listed in section 2 above, whether the function is supported by these four packages, and by a DBMS suppporting standard SQL.  Where a cell is blank it means that as far as I can find out the corresponding function is not provided (at least directly). In some cases there are functional overlaps or partial functionality which are hard to display in a simple tablular form.



FTOOLS
CURSA
STSDAS

MIDAS DBMS with SQL
browse

fv
xcatview



select
fselect

catselect
tselect
select/table
select
sort
fsort, fv

catsort
tsort
sort/table
select + order by
extrema




select .. top n
project

xcatview
tproject

compute/table
select
statistics

fstatistic
xcatview
tstat
statistics/table

select + avg()
regression


tlinear
compute/regression

group




select + group by
histogram
fhisto
catgrid
thistogram
compute/histogram

find
fv




delete
fdelrow, fv



delete
create index
findex




create index

update
fcalc



select
equi-join


tjoin


select + join
fuzzy join


catpair

join/table

(select + join)
append
fmerge



merge/table
insert

split

catselect




plot
fplot
catchart
gtedit

plot/table

edit

fv

gtedit
edit/table


sample

catselect



resample


tsample
rebin/table

differ
fdiff (?)

tdiffer




It is important to realise that none of these four packages is designed to handle anything more than fairly modest-sized tables: in particular none have any general concept of indexed access.  The catpair operation of CURSA relies on the second table being sorted in ascending order of one positional coordinate; the JOIN/TABLE operation of MIDAS generates an R-tree internally but does not preserve it.  Although there is a FTOOL to create an index, none of the other tasks is capable of making use of it.

There are also many statistical and visualisation packages available, which offer some of these functions.  As far as I can find out, every single one of them is designed to operate on datasets which can be stored in the central memory of a computer: perhaps this is not surprising, since to a statistician a collection of 100,000 points is a very large dataset indeed.  Although memory prices are still falling, and workstations with a gigabyte of memory are no longer exceptional, we are still some years away from being able to hold a table of a billion rows in memory.  This means that memory-based packages are currently unsuitable.

4. Implementation Notes

Both FTOOLS and CURSA are built using the cFITSIO library, which has network access built in: both of these packages can therefore access FITS files on remote machines using either FTP or HTTP protocols.  The way this works, however, is that the whole of the remote dataset is copied to local scratch storage.   The more advanced data access methods in OGSA-DAI may allow a finer-grained access to remote data.

None of the existing software packages looks attractive as a basis for the desired data mining facility, as they are really not designed to access the large tables as are now becoming commonplace.  To the FITS tasks one would need to add indexing and a better GUI.

Tabular Data Infrastructure

If a DBMS were used as the basis of the data mining facility, the main problem areas would be:

  • Lack of support for metadata such as physical dimensions of columns, and table-wide metadata.
  • Lack of any graphical facilities at all, or of efficient export formats for connecting to external packages.
  • Poor support for interactive facilities.  Some DBMS have the concept of a VIEW which is a subset, but it is not clear whether this is flexible enough for what is proposed.
  • An additional software layer would be needed to provide a persistent connection for the semi-interactive and batch facilities.
  • DBMS do not appear to be very efficient at sequential data access: our tests suggest that such operations are cpu-bound rather than I/O bound as one would expect.
  • Only a few DBMS support parallel hardware such as Beowulf clusters (e.g. DB2, SQL Server).
If the data mining facility were to be built upon FITS binary tables, the main problem areas would be:

  • Indexing would have to be added (not difficult, prototypes exist).
  • FITS tables of more than 231 bytes (2 Gbyte) are possible, but only with a file system supporting large files, and a C compiler supporting 64-bit integers.  Use of FITS tables with more than 231rows appears to be consistent with the Standard, but is not supported by the current cFITSIO library.  The largest current catalogues are smaller than this, but only with a factor of 2 in hand.
  • There is no built-in support for parallel hardware; it would be necessary to split up large tables into chunks distributed over the set of nodes.
User Interface

Whatever the underlying data access mechanism, a graphical user interface would be needed.  The best basis for this would probably the Java Swing package, which has good support for spreadsheet-style tabular displays.  For the more complex operations some sort of query language is needed: this has been the subject of a number of discussions, and something based on SQL SELECTs or Xquery syntax is an obvious choice, but many details remain to be resolved.

Client Side or Server Side?
It is not clear whether the data mining facility could best be provided as a stand-along client program, or as server-side software. I have deliberately written the functional requirements in section 2 to be independent of this decision.

Providing the data mining facility on the server which also holds large standard catalogues would obviously make them more accessible, but would make it more difficult for users to use the package with their own tables, as these would have to be explicitly uploaded.  The MySpace concept may help, but I forsee problems in persuading users to leave their valuable datasets for long periods on some unknown server elsewhere "on the grid".

Providing a software package which runs on the user's own machine would simplify access to the smaller datasets, and improve response time for simple interactive operations, but this would generally be inefficient for access to large remote catalogues.

-- ClivePage - 20 Jan 2003


The comments originally appended here from Clive Davehall, Patricio Ortiz, and Bob Mann have now been transferred to the Astrogrid Forum site where it will be easier to continue the discussion. I hope nothing got lost in the transfer.

Topic revision: r6 - 2003-01-20 - 14:39:49 - ClivePage
 
AstroGrid Service Click here for the
AstroGrid Service Web
This is the AstroGrid
Development Wiki

This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback