r2 - 06 May 2003 - 12:52:43 - BobMannYou are here: TWiki >  Astrogrid Web  > AG2DataMining

AstroGrid Data Exploration Facility

This page is for discussing ideas related to inclusion of a data exploration facility in the AstroGrid-II proposal.

Appended is a first stab at a partial draft of that section of the proposal. The ideas behind this came from discussions with ClivePage and Ken Brodlie from Leeds Univ CS is very interested in the visualization issues, but I am to blame for the deficiencies of this draft, which is big on vision but notably unsullied by specifics.

-- BobMann - 05 May 2003


AstroGrid Data Exploration Facility

Introduction

The Virtual Observatory should deliver to the desktop of every astronomer the capability to query, and obtain the data she wants from, all the world's significant data sources. The scientific success of the VO depends on the ease with which science can be extracted from those data - most importantly, through the new kinds of science that the VO makes practically possible for the first time. Foremost amongst this new science will be the exploration of the multi-dimensional data spaces created by the federation of distributed databases, and the AstroGrid Data Exploration Facility (ADEF) is designed to enable this type of research.

An idealised ADEF session

An astronomer has a hunch about connections between the properties of brightest cluster galaxies (BCGs) and those of their host clusters. She queries the VO to construct a sample of clusters observed with sufficient detail in the optical and the X-ray, and for which there exists good quality photometry for the BCG over a wide range of passbands. All in all, this might yield a set of 400 attributes for 10,000 BCG/cluster pairs. This is far too much data to analyse in detail, so the astronomer runs a statistics package which seeks the twenty attributes with the highest information content, and then generates a grid of scatter plots for pairs of them, arranged in order by the strength of the correlation between them. This reveals that there are very significant correlations between a set of six attributes, so the astronomer launches another visualization tool, which allows navigation through 3D projections of a higher dimensional data space. The full 10,000-object sample may be too large for this, so the astronomer selects a subsample of 200 objects, chosen by a statistical algorithm to be representative of the distribution of the full sample across the six attributes of interest. Experimenting with this tool suggests that there are, in fact, three clusters of points in this data space, which the astronomer thinks might correspond to distinct populations. The visualization tool allows the astronomer to "paint" the members of these three clusters in different colours, marking them as three separate subsets of the data, in a manner that is recognised by other tools in the ADEF system. This classification scheme is then applied to the full set of 10,000 records, and statistical tests run to assess its significance. This is found to be strong, so the astronomer saves the data from this session, and moves on to figuring out the astrophysical processes that might lie behind this division into three classes; the data storage being such that, at a later date, she can reload the data into the ADEF and overplot a set of predictions she has since generated from new models for the evolution of the cluster and its BCG, which may, in turn, have been produced using a numerical simulation service within the VO.

Functional Analysis of the idealised ADEF session

It is clear that there are several stages within the process described above:

1. Enquiry Formulation:

The astronomer works out what she wants to do and formulates the query that will select the required data from the VO. This may require several iterations, as the astronomer refines her query, based on results from the VO registry as to which relevant data are available, and what are the quality issues, etc, associated with them. This functionality will be delivered by the "first generation" VO, such as implemented by AstroGrid, but will clearly be refined over time.

2. Query Processing:

Once a final query has been defined, the VO system will select the best execution plan, possibly involving selection between different mirrors of some databases. Again, this will be delivered by AstroGrid, and is likely to build on the distributed query processing capabilities of OGSA-DAI.

3. Data Integration:

The process by which records from different databases are combined to yield, in this example, a large collection of attributes of BCGs and their host clusters, stored in the correct format for the data exploration session, is something that is likely to evolve with the VO. AstroGrid will certainly implement the matching of records by proximity, and may offer use of some more complex association techniques, but it is likely there will have to be work on this in AstroGrid-II.

The goal here is to minimise the volume of data that has to be transported between sites, but it will take significant effort to integrate sophisticated association techniques into the VO query specification process sufficiently well that there is no need for filtering of data extracted from databases: research into this topic is underway at Edinburgh (Emma Taylor PhD? project, funded from PPARC e-science studentship scheme), which a web service implementation of simple matching by proximity, called SkyQuery?, has been produced by the Johns Hopkins group. It is likely that default associations between popular pairs of databases (e.g. UKIDSS and SDSS) should be stored, rather than computed on the fly, but it is not clear how or where that information should be recorded. The obvious solution is within a database - either by means of attributes within the tables of the databases, or in a separate "associations" database - but there may be merit in recording this information in a manner that is more easily supplemented or challenged by third parties. Within the biological community, for example, it is commonplace for interested scientists to provide annotations to published databases and these can be incorporated within the database query mechanism in such systems as the Distributed Annotation Server (DAS); one might equally well imagine a Distributed Association Server mediating the identificatio of multi-wavelength IDs in the VO.

There are also open issues regarding the format in which to store the data to be explored: should they be loaded into a DBMS, should they be stored on disk in a compact binary format, or should they be kept in (possibly compressed) XML, as that might aid their being read into the data mining and visualization tools within the ADEF?

4. Data Exploration:

The core activity is the interactive exploration of the data, making use of a number of data mining and visualization algorithms from an interoperable toolbox. Some parts of this functionality are available within current systems. For example, the Mirage data exploration package developed at Lucent Technologies includes the ability to select ("paint") subsets of data in a window running one task and have them displayed in different colours in windows running other tasks.

[Ken to add something about the state-of-the-art in the relevant visualization topics.]

Several significant issues will need addressing, though, and are not covered by the plans of AstroGrid. For example, many existing data mining algorithms will not be scalable to the VO, either in terms of the number of dimensions they can study or the volume of data, since many are designed to work in main memory. It will be necessary to identify which these are, and to recast them in suitable form: this is likely to require a combination of work on the design of the algorithms themselves and on the software engineering to code them efficiently.

Another major issue is how to compose different tools within a web or Grid service framework and there is already work on this underway within the e-science community. For example, the DiscoveryNet? project is producing a prototype architecture for composing data mining services within a Grid environment, and, at a more general level, the vendor-led Data Mining Group is driving the development of Predictive Model Markup Language (PMML), an XML language for the description of statistical and descriptive models. Many astronomers may feel most at ease exploring data using packages that they already know, such as Interactive Data Language (IDL), or Starlink routines, rather than learning new tools from the data mining community, so it will be important to ensure that the service framework within which the ADEF operates can wrap familiar packages and readily allow the integration of new ones, developed by users.

Proposed ADEF work within AstroGrid-II

The original VO vision requires the existence of data mining facilities well in excess of what can be provided by the "first generation" VO projects, such as AstroGrid. Elements of what is required do already exist within the e-science and computer science communities - albeit often at a research prototype, not production, level - and it is important for the VO community to take full advantage of this previous work. Moreover, identical challenges face many other data-rich sciences, as evinced by the report from the NeSC Workshop on "Scientific Data Mining, Integration and Visualization", and some of these issues may well be best addressed at a generic level, to prevent the reinvention of the wheel within each discipline. We shall, therefore, be seeking additional funding for this activity as part of a multi-disciplinary study of data mining and visualization services for e-science, both to ensure that astronomy does not develop a system incompatible with wider standards, but also to leverage additional funding for the investigation of the data mining and visualization requirements of the VO, for these are an ideal exemplar of the more general situation in e-science.

Here, however, we outline our plans for ADEF-related work within the AstroGrid-II project: detailed scoping of such work is difficult, given that it is difficult to predict where the current VO and e-science projects will have reached by the start of 2005, when this work is supposed to commence.

[Need to add more on hardware as well as software side of things.]


Preliminary input from Ken Brodlie on visualization aspects

There are a number of approaches to the visualization of multivariate data. The problem is hard because we need to find a way of reducing the high dimensionality to a form that can be presented on a 2D display device (or 3D if stereo is used). Perhaps the two most successful are scatter plots and parallel co-ordinates.

In the case of scatter plots, it is usual to construct a grid of 2D scatter plots, in which each attribute is paired with every other in an attempt to recognise correlations and clusters in a pairwise manner. The complexity of the presentation increases as the number of attributes increases.

In the case of parallel co-ordinates, a single representation of the dataset is constructed, with a set of parallel axes, one for each attribute. Each record appears as a ‘polyline’, connecting the corresponding attribute values on each axis. Clusters appear as a set of nearly identical polylines, and correlations between attributes can be observed (if their axes are adjacent at least). A difficulty is the density of the representation as the number of records increases.

Recent research has sought to combine an initial pre-processing step, whereby the number of attributes and records is reduced by cluster analysis, for example, and the resulting visualization is less complex. The work of Ward is especially important, and is made publically available through the xmdvtool software, available for both Windows and UNIX platforms.

The challenges for this project will include the following:

  • The integration of the computer and human processing - how much analysis is done by the machine before the human takes over to visualize the data? Indeed the process has to be an iterative one, whereby the visualization is the prompt to suggest further machine analysis (in this context it is very similar to computational steering of large simulations in CFD for example)
  • The navigation through high dimensional spaces - the two techniques described above aim to present all the selected data in one visualization (albeit a grid of plots in the case of scatter plots). An alternative is to guide the user through a sequence of low-dimensional projections of the data - this guided tour being driven by data analysis so that only ‘interesting’ subsets are visited.
  • Collaborative analysis - modern visualization software such as IRIS Explorer allows geographically separated researchers work together on a visualization. How can we apply this technology to data analysis in AstroGrid?

KWB 6-May-03


Edit | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r2 < r1 | More topic actions
 
AstroGrid Service Click here for the
AstroGrid Service Web
This is the AstroGrid
Development Wiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback