AstroGrid Data Exploration Facility
This page is for discussing ideas related to inclusion of a data exploration facility in the
AstroGrid-II proposal.
Appended is a first stab at a partial draft of that section of the proposal. The ideas behind this came from discussions with
ClivePage and Ken Brodlie from Leeds Univ CS is very interested in the visualization issues, but I am to blame for the deficiencies of this draft, which is big on vision but notably unsullied by specifics.
--
BobMann - 05 May 2003
AstroGrid Data Exploration Facility
Introduction
The Virtual Observatory should deliver to the desktop of every astronomer the
capability to query, and obtain the data she wants from, all the world's
significant data sources. The scientific success of the VO depends on the
ease with which science can be extracted from those data - most importantly,
through the new kinds of science that the VO makes practically possible for
the first time. Foremost amongst this new science will be the exploration of
the multi-dimensional data spaces created by the federation of distributed
databases, and the
AstroGrid Data Exploration Facility (ADEF) is designed to
enable this type of research.
An idealised ADEF session
An astronomer has a hunch about connections between the
properties of brightest cluster galaxies (BCGs) and those of their
host clusters. She queries the VO to construct a sample of clusters
observed with sufficient detail in the optical and the X-ray, and for
which there exists good quality photometry for the BCG over a wide
range of passbands. All in all, this might yield a set of 400 attributes
for 10,000 BCG/cluster pairs. This is far too much data to analyse in
detail, so the astronomer runs a statistics package which seeks the twenty
attributes with the highest information content, and then generates a grid
of scatter plots for pairs of them, arranged in order by the strength of the
correlation between them. This reveals that there are very significant
correlations between a set of six attributes, so the astronomer launches
another visualization tool, which allows navigation through 3D projections of
a higher dimensional data space. The full 10,000-object sample may be
too large for this, so the astronomer selects a subsample of 200
objects, chosen by a statistical algorithm to be representative of the
distribution of the full sample across the six attributes of interest.
Experimenting with this tool suggests that there are, in fact, three
clusters of points in this data space, which the astronomer thinks
might correspond to distinct populations. The visualization tool
allows the astronomer to "paint" the members of these three clusters
in different colours, marking them as three separate subsets of the
data, in a manner that is recognised by other tools in the ADEF
system. This classification scheme is then applied to the full set of
10,000 records, and statistical tests run to assess its
significance. This is found to be strong, so the astronomer saves the
data from this session, and moves on to figuring out the astrophysical
processes that might lie behind this division into three classes; the
data storage being such that, at a later date, she can reload the data
into the ADEF and overplot a set of predictions she has since
generated from new models for the evolution of the cluster and its
BCG, which may, in turn, have been produced using a numerical
simulation service within the VO.
Functional Analysis of the idealised ADEF session
It is clear that there are several stages within the process described above:
1. Enquiry Formulation:
The astronomer works out what she wants to do and formulates the query that
will select the required data from the VO. This may require several iterations,
as the astronomer refines her query, based on results from the VO registry as
to which relevant data are available, and what are the quality issues, etc,
associated with them. This functionality will be delivered by the "first
generation" VO, such as implemented by
AstroGrid, but will clearly be refined
over time.
2. Query Processing:
Once a final query has been defined, the VO system will select the best
execution plan, possibly involving selection between different mirrors of some
databases. Again, this will be delivered by
AstroGrid, and is likely to
build on the distributed query processing capabilities of OGSA-DAI.
3. Data Integration:
The process by which records from different databases are combined to
yield, in this example, a large collection of attributes of BCGs and
their host clusters, stored in the correct format for the data
exploration session, is something that is likely to evolve with the
VO.
AstroGrid will certainly implement the matching of records by
proximity, and may offer use of some more complex association techniques,
but it is likely there will have to be work on this in
AstroGrid-II.
The goal here is to minimise the volume of data that has to be transported
between sites, but it will take significant effort to integrate
sophisticated association techniques into the VO query specification
process sufficiently well that there is no need for filtering of data
extracted from databases: research into this topic is underway at
Edinburgh (Emma Taylor
PhD? project, funded from
PPARC e-science
studentship scheme), which a web service implementation of simple matching
by proximity, called
SkyQuery?, has been produced by the Johns Hopkins group.
It is likely that default associations between popular pairs of databases
(e.g.
UKIDSS and SDSS) should be stored, rather than computed on the fly, but
it is not clear how or where that information should be recorded. The obvious
solution is within a database - either by means of attributes within the
tables of the databases, or in a separate "associations" database - but there
may be merit in recording this information in a manner that is more easily
supplemented or challenged by third parties. Within the biological community,
for example, it is commonplace for interested scientists to provide annotations
to published databases and these can be incorporated within the database
query mechanism in such systems as the Distributed Annotation Server (DAS); one
might equally well imagine a Distributed Association Server mediating the
identificatio of multi-wavelength IDs in the VO.
There are also open issues regarding the format in which to store the data to
be explored: should they be loaded into a DBMS, should they be stored on disk
in a compact binary format, or should they be kept in (possibly compressed)
XML, as that might aid their being read into the data mining and visualization
tools within the ADEF?
4. Data Exploration:
The core activity is the interactive exploration of the data, making use
of a number of data mining and visualization algorithms from an interoperable
toolbox. Some parts of this functionality are available within current
systems. For example, the Mirage data exploration package developed at
Lucent Technologies includes the ability to select ("paint") subsets of data
in a window running one task and have them displayed in different colours
in windows running other tasks.
[Ken to add something about the state-of-the-art in the relevant visualization
topics.]
Several significant issues will need addressing, though, and are not covered by
the plans of
AstroGrid. For example, many existing data mining algorithms will
not be scalable to the VO, either in terms of the number of dimensions they can
study or the volume of data, since many are designed to work in main memory.
It will be necessary to identify which these are, and to recast them in
suitable form: this is likely to require a combination of work on the design
of the algorithms themselves and on the software engineering to code them
efficiently.
Another major issue is how to compose different tools within a web or
Grid service framework and there is already work on this underway within the
e-science community. For example, the
DiscoveryNet? project is producing a
prototype architecture for composing data mining services within a Grid
environment, and, at a more general level, the vendor-led Data Mining Group is
driving the development of Predictive Model Markup Language (PMML), an
XML language for the description of statistical and descriptive models. Many
astronomers may feel most at ease exploring data using packages that they
already know, such as Interactive Data Language (IDL), or Starlink routines,
rather than learning new tools from the data mining community, so it will be
important to ensure that the service framework within which the ADEF operates
can wrap familiar packages and readily allow the integration of new ones,
developed by users.
Proposed ADEF work within AstroGrid-II
The original VO vision requires the existence of data mining facilities well
in excess of what can be provided by the "first generation" VO projects,
such as
AstroGrid. Elements of what is required do already exist within
the e-science and computer science communities - albeit often at a research
prototype, not production, level - and it is important for the VO community
to take full advantage of this previous work. Moreover, identical challenges
face many other data-rich sciences, as evinced by the report from the
NeSC
Workshop on "Scientific Data Mining, Integration and Visualization", and
some of these issues may well be best addressed at a generic level, to prevent
the reinvention of the wheel within each discipline. We shall, therefore,
be seeking additional funding for this activity as part of a multi-disciplinary
study of data mining and visualization services for e-science, both to ensure
that astronomy does not develop a system incompatible with wider standards,
but also to leverage additional funding for the investigation of the data
mining and visualization requirements of the VO, for these are an ideal
exemplar of the more general situation in e-science.
Here, however, we outline our plans for ADEF-related work within the
AstroGrid-II project: detailed scoping of such work is difficult, given that
it is difficult to predict where the current VO and e-science projects will
have reached by the start of 2005, when this work is supposed to commence.
[Need to add more on hardware as well as software side of things.]
Preliminary input from Ken Brodlie on visualization aspects
There are a number of approaches to the visualization of multivariate data. The problem is hard because we need to find a way of reducing the high dimensionality to a form that can be presented on a 2D display device (or 3D if stereo is used). Perhaps the two most successful are scatter plots and parallel co-ordinates.
In the case of scatter plots, it is usual to construct a grid of 2D scatter plots, in which each attribute is paired with every other in an attempt to recognise correlations and clusters in a pairwise manner. The complexity of the presentation increases as the number of attributes increases.
In the case of parallel co-ordinates, a single representation of the dataset is constructed, with a set of parallel axes, one for each attribute. Each record appears as a ‘polyline’, connecting the corresponding attribute values on each axis. Clusters appear as a set of nearly identical polylines, and correlations between attributes can be observed (if their axes are adjacent at least). A difficulty is the density of the representation as the number of records increases.
Recent research has sought to combine an initial pre-processing step, whereby the number of attributes and records is reduced by cluster analysis, for example, and the resulting visualization is less complex. The work of Ward is especially important, and is made publically available through the xmdvtool software, available for both Windows and UNIX platforms.
The challenges for this project will include the following:
- The integration of the computer and human processing - how much analysis is done by the machine before the human takes over to visualize the data? Indeed the process has to be an iterative one, whereby the visualization is the prompt to suggest further machine analysis (in this context it is very similar to computational steering of large simulations in CFD for example)
- The navigation through high dimensional spaces - the two techniques described above aim to present all the selected data in one visualization (albeit a grid of plots in the case of scatter plots). An alternative is to guide the user through a sequence of low-dimensional projections of the data - this guided tour being driven by data analysis so that only ‘interesting’ subsets are visited.
- Collaborative analysis - modern visualization software such as IRIS Explorer allows geographically separated researchers work together on a visualization. How can we apply this technology to data analysis in AstroGrid?
KWB
6-May-03