Final Report on WP-A5: The AstroGrid Pilot Programme
Summary
The aim of the
AstroGrid Pilot Programme was to identify issues arising in database federation that should be addressed in detail in
AstroGrid's technical workpackages.
A set of five pilots was chosen, to address different aspects of the
general database federation problem and to involve the different parts
of
AstroGrid's user community.
Problems with the availability of the necessary datasets seriously
affected the progress of two of the pilots - namely the optical/near-IR
and X-ray pilots - and these were suspended midway through Phase A; the issues that these two pilots were intended to address remain important, and, since their data access problems appear to have been resolved, work on them will resume and at least some of their planned tasks will
be completed before the start of Phase B. The remaining three pilots -
the radio, solar and STP pilots - proceeded through the full Phase A period, and all delivered working software that was used by test users.
Many useful lessons have been learnt from the
AstroGrid Pilot Programme. At the most general level, test users responded positively to the additional functionality they were offered, but quickly wanted the ability to do more. This reassuringly confirms that the VO enterprise is worthwhile, but also that it will be difficult to meet expectations, and that considerable flexibility will have to be designed into VO systems to help them meet the range of user requirements. A number of more specific results emerged, too, for
example that the VOTable prescription for presenting tabular data in XML may well be useful well beyond its originally intended domain, if appropriate UCDs (e.g. for interferometric, solar physics and STP data) can be defined. It was also clear that the VO must provide the means of estimating the resource implications (e.g. result dataset volume, time taken to run, etc) of proposed operations, lest users grind the system to a halt and/or generate much more data than they
can handle.
A questionnaire circulated by the teams running the STP and solar pilot, in conjunction with ESA's SpaceGRID initiative, and addressed to a wide cross-section of the international solar system research community, produced some quite explicit performance requirements, for example: the system should provide feedback on
an action with 30s; simple, online tasks should be completed within an average time of a minute; and complex, offline tasks should be completed within an average time of 24 hours. Interestingly, this
survey also identified some Intellectual Property Rights issues not discussed much within the VO community to date, as some respondents thought that there should be a possibility of keeping workflows and query results private within the VO.
Perhaps the most valuable lesson learnt from the
AstroGrid Pilot
Programme is the importance of keeping (at least some) users closely engaged in the VO development process. New technologies, and the vast wealth of astronomical data now available, mean that the possible directions that the VO can take greatly exceed what can possibly be delivered, given the finite funding for the various VO projects, and it is essential that the course of VO development is decided by what users most need to do their science and not what the technology
can deliver most readily.
The main body of this report describes the progress on
WP-A5 up to the end of August 2002, when Phase A was originally intended to finish. An Appendix will present further results, up to the close of the extended Phase A, at the end of December 2002.
Introduction
The
AstroGrid Pilot Programme was designed to complement the Phase A technology assessment workpackages. While the latter were intended to evaluate some of the technologies likely to be deployed to help meet
AstroGrid's science requirements, the Pilot Programme
was designed to help elucidate those requirements in more detail, by developing prototype systems to deliver small
portions of
AstroGrid's desired functionality. These prototype
systems were intended to be constructed using technologies that staff were familiar with; the accent here was on learning from the
experience of producing pilot software to meet aspects of
AstroGrid's
needs, without necessarily doing so in the way that
AstroGrid would ultimately meet them. An important goal from the outset was that all pilots should proceed to the point of delivering software
that could be used by test users to do real science, but, equally, it was stressed that no undue effort should be expended in
producing pretty user interfaces, etc, as the lessons to be learnt
concerned the delivery of new functionality, not its presentation
to users.
It was decided that a set of five pilots should be undertaken, one each for the five broad fields covered by
AstroGrid, namely optical/near-IR astronomy, X-ray astronomy, radio astronomy,
solar physics and solar/terrestrial physics (STP). The reason for this was two-fold. Firstly, each of these areas has particular requirements, and it was considered important that the specific
problems facing all five areas should be addresed from the start, to
ensure that the final
AstroGrid system meets the needs of its full user community. Secondly, it was intended that the test users for the pilots would be selected from the likely "early adopters" of
the final
AstroGrid system, so, by covering all disciplines, it was hoped that the Pilot Programme would start to engage appropriate members of all communities in the work of
AstroGrid.
The Pilot Programme workpackage (
WP-A5) was divided into six subpackages - one for each pilot, and an overall coordination activity - and these are described in turn below. In each case we start by listing the
Objectives,
Inputs,
Outputs and
Tasks with which the subpackage was specified in the
WP-A5 Project Description produced at the start of Phase A, and for the five pilots we reproduce the
Example Science Cases that motivated each. The numbering of the Tasks formed the basis for the reporting of progress during Phase A:
Quarterly Reports and Forecasts were expressed in terms of work completed or planned on these numbered Tasks.
WP-A5.0: Pilot Programme Coordination
Objectives:
- To ensure the satisfactory completion of all five pilots and their active engagement with AstroGrid's technical workpackages.
- To record all salient information regarding the undertaking of the Pilot Programme and ensure that the lessons learnt from it are considered in the planning of AstroGrid's Phase B.
Inputs:
- Reports from the WPMs of the WP-A5.n subpackages, and their project teams.
- Communication with WPMs of WP-A1, WP-A2, WP-A3, WP-A4, and WP-A9.
Outputs:
- Report on Pilot Programme at end of Phase A.
- Inputs to the development of the Phase B plan.
Tasks:
- 5.0.0 Design and requirements
- 5.0.1 Project management tasks
- 5.0.2 Programme monitoring
- 5.0.3 Report on pilot programme
Quite soon after the Pilot Programme started in earnest, it became
clear that it would not be possible to achieve a high level of coordination, either between the different pilots or between them and the remaining workpackages in the
AstroGrid Phase A programme. This conclusion was reached for a variety of reasons. The first of these
was the lack of time and staff effort available - both for
WP-A5.0 itself and within the five pilots - for such activity. In common, no doubt, with the rest of the
AstroGrid Phase A programme,
WP-A5 received a smaller amount of effort, in FTE terms, than it was formally allotted, because the staff undertaking the pilots were devoting significant amounts of time to "general
AstroGrid work". Such work - much reading and writing of documents, attending meetings and courses, etc - was absolutely essential to the development of the
AstroGrid project as a whole (e.g. through the production of
Science Problems and
Use Cases as part of the Science Requirements analysis) and to helping staff climb the steep learning curve associated with a "bleeding edge" project such as this, but it did not contribute directly to the specific tasks of
WP-A5. In the light of this, it was decided to devote the available effort to the progress of the individual pilots, and to assign a lower priority to
their coordination, both with each other and with the rest of the Phase A programme. So, the work of
WP-A5.0 reduced to that of monitoring the progress in each pilot and the setting up a Wiki page (
PilotDocs)
on which short reports from the Pilot Programme could be posted.
Allowing each pilot to proceed more independently than originally envisaged also reflected the geographical distribution of the
WP-A5 team: staff working on four of the five pilots were co-located in a single institution per pilot, and in the fifth case (
WP-A5.4), the two institutions involved were relatively close and the individual staff used to working closely together. Also, the five pilots were, by design, domain-specific and working on very different problems, so the scope for adopting a common approach and sharing directly relevant information proved very limited; the one exception being the common user requirements survey undertaken for
WP-A5.4 and
WP-A5.5 in conjunction with ESA's
SpaceGRID initiative.
One disadvantage of allowing the pilots to proceed relatively independently was that there was less interaction between
WP-A5 and the rest of the Phase A programme than was anticipated. It was not found possible for lessons from the pilots to be fed rapidly to the technical workpackages or for their results to influence the course of the pilots. To a large extent, though, this reflects not so much a
lack of communication between
WP-A5 and the other workpackages, but rather the fact that the technology evaluations in the technical workpackages proceeded at a fairly generic level, thereby reducing the possibility of direct interaction with the nitty-gritty of the pilots. It is important that the "implementation" and "R&D" strands within Phase B are coupled more tightly than this, but that should be facilitated by the iterative approach planned, since what "R&D"
work there is will be shorter-lived and more directly related to the delivery of software modules in the next iteration.
While the coordination activity of
WP-A5.0 was looser than perhaps originally anticipated, it was perfectly adequate for monitoring the progress of the Pilot Programme as a whole. As detailed below, it was decided to suspend work on
WP-A5.1 and
WP-A5.2 part way through, when it became clear that their progress was slow and that the effort allocated to them would be more profitably employed elsewhere within the Phase A programme. Summaries of
WP-A5 progress were presented at all consortium meetings, as well as via
Quarterly Reports, and a presentation on the Pilot Programme was made at the ESO/ESA/NASA/NSF Astronomy Conference
"Toward an International Virtual Observatory" held at Garching in June 2002.
In summary, the coordination activity of
WP-A5.0 successfully monitored the progress of the Pilot Programme. In terms of the original
Objectives, it ensured the satisfactory completion of three of the five pilots and re-directed effort from the other two when it was clear that their continuation would be an ineffective use of resources. The engagement of the Pilot Programme with the technical workpackages was weaker than anticipated, as a result of the way both parts of the Phase A programme developed, and it is important that the "R&D" strands within Phase B are kept more tightly coupled to its "implementation" activity. Finally, the salient information regarding the undertaking of the Pilot Programme was recorded, via the identification of deliverables to be posted on a dedicated Wiki page (
PilotDocs), which could also influence the planning of
AstroGrid's Phase B.
WP-A5.1: Optical/Near-IR pilot - Large Object Catalogues
Objectives:
- To produce working federations of two pairs of datasets: the Sloan EDR and SuperCOSMOS Sky Survey, using SX; and INT WFC and CIRSI data using VizieR.
- To compare the suitability of the two approaches for performing the sort of complex, multi-parameter searches which VO users will want to perform on large object catalogues in the future.
Inputs:
- A copy of the Early Data Release (EDR) database of the Sloan Digital Sky Survey (SDSS).
- The SuperCOSMOS Sky Survey (SSS) database.
- Object catalogues from the INT-WFC and INT-CIRSI.
- The VizieR and SX archive systems.
Outputs:
- Large object catalogue federation test bed.
- Feedback on prototype system from test users.
- Inputs of requirements to WP-A1.
- Inputs of implementation issues to WP-A2, WP-A3, WP-A4 and WP-A9.
- Inputs to the development of the Phase B plan.
Tasks:
- 5.1.0 Design and requirements
- 5.1.1 Obtain data needed for pilot federation in desired format
- 5.1.2 Determine what changes must be made to SX for the SSS implementation
- 5.1.3 Implement modifications to SX
- 5.1.4 Provide UI for SSS-SX
- 5.1.5 Evaluation of test bed implementations
Example Science Cases:
A scientist wishes to search for halo white dwarf stars, which requires selection criteria making use of both colour and proper motion information for a large sample of stars (for these are rare objects). This can be achieved by querying the multi-epoch/multi-colour dataset produced by federating
the Sloan EDR dataset with the SuperCOSMOS Sky Survey coverage of the same region.
A scientist wants to determine the optical and infrared colours
of an object, e.g. an X-ray (XMM) or radio (FIRST) or far infrared (ISO) source. This can be achieved by querying the multi-colour dataset produced by federating the INT WFC five colour optical dataset with the 2 colour INT CIRSI dataset of the same region.
The optical/near-IR object catalogues to be included in the
AstroGrid
system greatly exceed all its other databases in size, so there was a clear need for a pilot that addressed the practical problems of federating large object catalogues. Initially, it was intended to study this using two methods. Firstly, data from the SuperCOSMOS Sky Survey (
SSS) covering
the fields of the Sloan Digital Sky Survey (
SDSS) Early Data Release (EDR) would be federated with the EDR data themselves, using a version of the SDSS science archive software (called
SX), modified for use with the
SSS data model. Secondly, catalogues of objects derived from INT imaging with the
Wide Field Camera (optical) and
CIRSI (near-infrared) would be federated by making them both accessible
via the
VizieR system developed by the Centre de Données astronomiques
de Strasbourg (
CDS). The functionality provided by the two approaches could then be compared by providing access to both federations to test users, who can assess how well they each match the needs of scientists using future federations of large object catalogues in the VO. In the event, it was
decided to evaluate the use of
VizieR in this regard only within the scope of
AVO Work Area 2 (Interoperability), not as part of
WP-A5.1: the
results of this work will be reported elsewhere and we shall only consider the
SSS-
SDSS federation in what follows.
SX is a database system developed by Alex Szalay's group at Johns Hopkins
University for the Sloan Digital Sky Survey (
SDSS) science archive. It is
built upon
Objectivity/DB, a commercial object-oriented database management system (DBMS). At the time that the Phase A programme was being designed, the
SX/
Objectivity system was unquestionably the leading solution in use within astronomy for handling large object catalogues, so it was the natural choice on which to base the optical/near-IR pilot. The Wide Field Astronomy Unit (
WFAU) in Edinburgh were to create a mirror of the
SX-based archive of the
SDSS-EDR, so it made sense to centre the pilot on the federation of
Sloan and
SuperCOSMOS data in the EDR region.
While
WFAU staff worked to install the EDR mirror in Edinburgh the effort in
WP-A5.1 was directed towards Tasks 5.1.2 - 5.1.4, namely the production of an
SX-like database system to hold
SuperCOSMOS Sky Survey catalogue data. The first step in this process was to determine what changes would have to be made to
SX for the
SSS implementation. Despite
having the potential for much more general applicability, the
SX system
had been designed specifically for the
SDSS, rather than as a general astronomical database system, so that code specific to the
SDSS schema was hardwired
into
SX in many places. So, Task 5.1.2 reduced to the identification
of all schema-dependent code within the
SX system, as well as the design of a new schema for the
SX implementation of the
SSS to use in those places, and both these steps had to be complete before work on Task 5.1.3 (Implement modifications to
SX) could begin in earnest.
An
SX-like schema was designed for the SuperCOSMOS Sky Survey (
SSS). This copied the
SX design (detailed at
http://archive.stsci.edu/sdss/software/sdssQT_guide/introclasses.html) of having compact "tag" objects which contain (what is felt to be) the most popular subset of the atttributes of the
SSS objects, i.e. those most likely
to be used in queries frequently. It also greatly expanded the range of
SSS data that would be available to users, by including a great deal of "housekeeping" metadata not made accessible via the
current WWW-based system, which links to flat files arranged solely by position, and which, therefore, supports positional queries
only.
The identification of schema-dependent code was started, with help from some of those at Johns Hopkins (notably Ani Thakar) who had written
SX.
We shall not relate the details here, but most of the schema-dependent code was identified as residing in the following modules: sxSchema (which sets up the objects in the schema); sxLoader (which loads data in the database); sxAbstract (which provides a run-time schema against which queries are checked); and sxDatabase (which includes the parsing of queries by
SX's own query parser and their translation into
Objectivity/DB's built-in query primitives).
While the work of Task 5.1.2 was still underway, the
SDSS consortium announced that they were not going to use
Objectivity/DB beyond the
EDR; subsequent data releases were to be made using a system based on
SkyServer, which was developed by the Johns Hopkins group in conjunction with Jim Gray of Microsoft Research, and which is based on
SQLServer, Microsoft's relational DBMS. Clearly, this decision reduced
AstroGrid's level of interest in the
SX system, since it was no longer a direct prototype for future VO systems. At the same time, it was noted that the slow progress with the installation of an EDR mirror at Edinburgh was pushing the remaining work on
WP-A5.1 later and later into Phase A, so it was decided that work on the optical/near-IR pilot be suspended, as the effort allocated to it could be more usefully employed elsewhere within the Phase A programme.
In the last couple of months,
WFAU staff have started collaborating with Jim Gray to produce a new
SSS database using
SQL Server. The design of the schema for this new database made use of much of the thought and design work that went into producing the
SX-like schema in Task 5.1.2, so that work was not wasted, despite the suspension of
WP-A5.1. With the extension of Phase A to run to the end of 2002, it is likely that much of the
SSS-
SDSS
federation work originally intended to be undertaken using the
SX/
Objectivity system can now be completed using this new system based on
SQL Server.
WP-A5.1 is therefore resurrected and is now making good progress.
In summary, the original
Objectives of
WP-A5.1 have not yet been met. Use of the
VizieR system for the federation of large object catalogues is to be considered solely within the context of
AVO Work Area
2 (Interoperability) and results will be reported elsewhere. The federation of
Sloan and
SuperCOSMOS data in the
Sloan EDR region using
the
Objectivity-based
SX system was terminated, due to the slow progress with getting an EDR mirror installed at Edinburgh and the decision by the
SDSS consortium to discontinue use of
Objectivity/DB. A new
SSS
database is now being created, using
SQL Server and it is hoped that
much of the work originally planned for
WP-A5.1 can now be completed by the end of December 2002, using this new system, instead of
SX.
WP-A5.2: X-ray pilot - Association techniques
Objectives:
- To produce a federation of optical and X-ray datasets in a significant number (several hundred) XMM-Newton fields, which can be used for scientific analysis.
- To assess the requirements for performing, within the VO, associations between catalogues of objects detected in different wavebands.
Inputs:
- A prototype version of the first XMM-Newton source catalogue to be released by the SSC.
- Object catalogues derived from optical data in the same fields as the XMM-Newton observations included in the X-ray catalogue.
Outputs:
- Associations test bed.
- Feedback on prototype system from test users.
- Inputs of requirements to WP-A1.
- Inputs of implementation issues to WP-A2, WP-A3, WP-A4 and WP-A9.
- Inputs to the development of the Phase B plan.
Tasks:
- 5.2.0 Design and requirements
- 5.2.1 Prepare optical data for fields covered by test bed catalogue
- 5.2.2 Interface to prototype SSC catalogue
- 5.2.3 Association methods
- 5.2.4 Evaluate testbed implementation
Example Science Case:
A scientist wishes to determine the variation of some X-ray hardness ratio as a function of X-ray flux and optical/near-infrared colour, to constrain models for the properties of the population of obscured AGN. To do this requires the association of objects in X-ray and optical/near-infrared catalogues, followed by the selection of subsamples of associated sources on the basis of X-ray properties.
This pilot is the only one of the five whose specification changed markedly between the submission of the original
AstroGrid proposal in April 2001 and the start of Phase A in September 2001. In the original proposal, the X-ray pilot was intended to federate data products from
XMM-Newton and
Chandra. The most basic type of data in X-ray astronomy is the event list, which records the temporal and spatial location of every incident X-ray detected by the instrument, togather with a measure of its energy. The information contained in the event list can be viewed in a variety of ways - to produce an image, a spectrum, a lightcurve - and the aim of the original X-ray pilot was to develop prototype tools to help the user shift readily between these different representations, at least for
XMM-Newton and
Chandra data. During the discussions defining the Phase A programme it was decided that this work would not provide much more than is already available within existing X-ray data analysis packages, and that the development of any further functionality was more properly the job of the groups that build such software packages, rather than that of
AstroGrid.
Instead, it seemed better to undertake a pilot to assess association
techniques, based upon the finding of optical counterparts of
XMM-Newton sources in data from the
INT and
SuperCOSMOS. The association of entries in different databases identified as being observations of the same astronomical object lies at the heart of the Virtual Observatory (VO) concept, but it appears to have received little attention from the VO community to date. Astronomical data has a natural indexing - spatial location in the sky - which aids the making of such identifications, but, in many cases, association by spatial proximity alone is not adequate. This is most clearly the case in situations, such as the determination of optical counterparts for infrared sources from
ISO or submillimetre sources from
SCUBA, where the angular resolution of one catalogue is so poor, and the surface density of objects in the other so high, that there can be many objects from one catalogue located within the positional error ellipse of each source from the other. In this case, the easy identification of the true optical counterpart is not possible, and a probabilistic method must be used, to assess which of the candidate counterparts is the most likely to be the true match. Astronomers already use several such probabilistic prescriptions, but they are frequently used in situations in which each possible association can be checked manually for plausibility. The VO provides
a much greater challenge. Not only does the size of the datasets to be
made available by the VO mean that associations could be sought between databases containing millions of objects each, raising concerns about the scalability of the assocation techniques commonly used, but there are also issues relating to how the method employed to make a set of associations can be recorded so that a later user can judge whether s/he can employ them with confidence, rather that recomputing them all anew.
These issues are discussed in more detail on the Wiki page
WPA5AssociationMethods produced as part of Task
5.2.3, and it should be noted that the X-ray/optical association planned for this pilot has one specific complication not discussed there, namely the possibility of real one-to-many associations. Amongst the most prevalent objects in the X-ray sky are galaxy clusters. In the X-ray, these appear as regions of extended emission (with a particular spectral signature), but in the optical they are resolved into their constituent galaxies. Within the galaxy cluster survey community, there exist methods for dealing with this complication - for example, smoothing optical data to the angular scale of clusters, or searching for clusters in optical data using matched filter technqiues and then seeking matches for them in X-ray data - but in most cases, these depend on there being a candidate cluster (i.e. an extended X-ray source) under study. It would take some thought to integrate such methods into a more general procedure for finding the optical counterparts of X-ray sources, and it is possible that the association procedures for point-like and extended X-ray sources would have to be pretty independent, as a result. (
N.B. There are several other scenarios in which one-to-many associations can arise in astronomy, often because the emission seen at the different wavelengths originates in spatially distinct regions.
For example, a (relatively) low resolution IRAS source may be associated with star formation in a tidal tail linking an interacting pair of galaxies seen as two distinct objects in the optical, or a double radio source maybe located to both sides of what looks like a normal, isolated galaxy in the optical. This can be particularly problematic when the source is well resolved - e.g. in high-resolution radio mapping - in which case it can be complicated to include a description of the source's structure in the association procedure.)
The revised plan for
WP-A5.2 was to start addressing these issues by seeking optical counterparts for sources in the first
XMM-Newton source catalogues to be released by the XMM-Newton Survey Science Centre (
SSC) based in Leicester. At the time this revised pilot was specified, it was anticipated that this first
XMM-Newton source catalogue would contain in excess of 50,000 sources, drawn from
XMM-Newton observations of more
than 700 fields. In the event, the X-ray pilot of
WP-A5.2, like the optical/near-IR pilot of
WP-A5.1, suffered seriously due to significant delays in the delivery of its input datasets. In this case, this was a delay with the production of the first
XMM-Newton source catalogue by the XMM-Newton Survey Science Centre (
SSC), which pushed the bulk of the work on
WP-A5.2 later and later in Phase A. In view of that, it was decided that work on
WP-A5.2 should be suspended and the effort allocated to it moved to other parts of the Phase A programme, where it could be used more effectively.
Only a fraction of the work of
WP-A5.2 had been completed at that point. In terms of the original
Task list above, this was
work on 5.2.0, 5.2.1. and 5.2.3. A discussion of the design and requirements, expanding on that outlined above is presented in the Wiki document
XrayPilotDesign, while, as noted above, the Wiki document
WPA5AssociationMethods was produced as part of Task 5.2.3. Some of the other work from that Task (Association methods)
and also Task 5.2.1 (Prepare optical data for fields covered by test bed catalogue) is described in Wiki documents produced in conjunction with
WP-A4 work on database systems: notably Clive Page's document
"Indexing the Sky" includes discussion of the purely spatial matching side to associations, while the requirements of
WP-A5.2 were borne in mind when selecting the criteria used in the
"DBMS Evaluations". In addition, an optical/X-ray cross-matching between a ROSAT catalogue and the USNO-A2 catalogue was performed successfully, but the source density there was insufficient to tax the association method too much.
In summary, the X-ray/optical federation planned for
WP-A5.2 has not yet been completed. Work on this pilot was suspended, mid-way through Phase A, due to delays with the delivery of the first
XMM-Newton source catalogue by the XMM-Newton Survey Science Centre (
SSC). Much of the preparatory work for this pilot has been completed, however, either as part of
WP-A5.2 itself or in conjunction with the database work of
WP-A4. Now that the
XMM-Newton source catalogue is available, it will be possible to resurrect this pilot, and it is hoped that some fraction of what remains of its originally-planned work will be completed by the time the extended Phase A closes, at the end of December 2002.
WP-A5.3: Radio pilot - Fourier data
Objectives:
- To develop protocols and methods for remote access to radio interferometer data.
- To prototype parallelization of imaging software for Beowulf cluster (in conjunction with AVO WA3.3).
- To demonstrate a simple interface to provide remote access to selected subsets of these data.
Inputs:
Outputs:
- Fourier data federation test bed.
- Feedback on prototype system from test users.
- Inputs of radio requirements to WP-A1.
- Inputs of implementation issues to WP-A2, WP-A3, WP-A4 and WP-A9.
- Inputs to the development of the Phase B plan.
Tasks:
- 5.3.0 Design and requirements
- 5.3.1 Evaluate aips++ for remote/distributed access to visibility data
- 5.3.2 Develop prototype environment for access to visualisation data
- 5.3.3. Develop metadata standards for interferometric data
- 5.3.4 Select and prepare radio data for use in pilot
- 5.3.5 Trial of remote processing on JBO COBRA (aips++ pimager)
- 5.3.6 Investigate tools for multi-wavelength comparisons
- 5.3.7 Evaluate test bed implementation
Example Science Case:
The incidence of AGN in star-forming galaxies is an important test of theories of galaxy evolution. An astronomer addresses this issue by taking X-ray (e.g. Chandra) and optical/near-IR (e.g. CFHT or Subaru)
catalogues, selecting a sample of candidate AGN and generating a radio
image around the position of each from archival visibility data. The radio structure on various scales (including any evidence of mergers) and the radio spectral index can then be used to reveal starburst regions and obscured AGN.
High resolution radio interferometer arrays produce data sets which
are samples of the Fourier transform of the radio sky. These 'visibility data' can be processed in different ways depending on the astronomical requirements. Since sampling in the Fourier plane can be sparse, non-linear deconvolution is a necessary and critical step in the production of images which can be easily interpreted. Although the field-of-view of high resolution interferometer arrays is often large, the information content can be significantly smaller, due to the limited Fourier sampling. It is therefore more efficient and more productive to maintain access to the data in the Fourier plane and produce images or data products on demand, with options as to whether to carry out a particular deconvolution, fit models to the data in the Fourier plane, combine with data from another interferometer, or to select a particular region on the sky. This pilot, taken together with work from
AVO WA3.3 (
Scalable computing and storage), was designed to produce a test-bed system that allows users to
access visibility data remotely and then launch image production on-the-fly, tailored to the requirements of the specific science goal, and implemented in a parallel computing environment to enhance speed. In a further stage, it was intended that it should include associations with data from other catalogues, using sophisticated criteria to allow for source structure and different resolution scales.
It was decided to undertake this pilot using data from
a region of sky which has been extensively observed right across the electromagnetic spectrum, namely the
HDF(N). The unprecedented sensitivity of the observations required and produced very large data sets, from which there is still significant scientific information to be extracted. This makes it an ideal candidate for the
AstroGrid pilot, and some aspects are dealt with in the
Science Problem "DeepFieldSurveys". Source lists are available for many observations and fully calibrated data is held at
Jodrell Bank from the work of Muxlow et al. (
MERLIN), Richards et al. (
VLA), Williams et al. (
HST), and Barger et al. (
CFHT): at a later date, it may be possible to include additional public data, from Garrett et al. (
EVN and
WSRT), Aussel et al. (
ISO), Hornschmeier et al. (
Chandra) and Hughes et al. (
SCUBA).
AIPS++ was installed on the
COBRA cluster, but there has been only limited success in testing 'pimager', a parallel implementation of the main
aips++ imaging and deconvolution task. However, it transpires that pimager has yet to be parallelised effectively for the production of cut-out images from large visibility data sets.
WP-A5.3 staff at
Jodrell Bank are in contact with the
AIPS++ developers and hope to contribute to the future production of suitable tasks. They are also following developments at smaller arrays which are able to use the present release more extensively:
NRAO are building an on-line archive of raw
VLA data, and extracting the metadata as
aips++ tables;
ATCA are developing an archiving pipeline with a view to integration into the Australian VO; and
BIMA are also investigating the use of
AIPS++ with
Globus.
Given the partial success of this
aips++ evaluation, it was decided to proceed with Task 5.3.2 (
Develop prototype environment for access to visibility data), using
"classic" AIPS inside a cgi wrapper. By this route, the
MERLIN Archive now supports simple queries (returning text and ready-made plots and
FITS images) either via its web page (
www.merlin.ac.uk/archive) or via
CDS. A prototype interface for on-the-fly imaging of visibility data was produced and has been tested locally and remotely. This enables users to extract maps from calibrated visibility data at any position within the field of view. Typically, only the central few arcsec of archive data have ever been imaged. The user has only to specify the size, position and resolution required to obtain an image, in this case showing the field of a supernova which went off a few years after the centre of NGC 7469 was observed in HI.
The plots below shows the results from this prototype to extract maps from the archival visibility data for NGC 7469.
Fig 1.
This shows 5GHz data from 2001. The left hand panel shows the Seyfert nucleus of NGC 7469, and the right hand panel shows a field out in one of the galaxy's spiral arms where a radio supernova was detected serendipitously.
Fig 2.
This shows archival 1.6GHz data from 1993. The left hand panel shows
the Seyfert nucleus of NGC 7469, while the right hand panel shows
a 5GHz map of the same field as the right hand panel of Fig. 1,
exhibiting the lack of a radio supernova.
Astronomers using this facility do not need any knowledge of
AIPS or of radio data, but behind the scenes the archive uses two alternative routes depending on the size and complexity of the input data sets.
MERLIN-only visibility data around a chosen region can be imaged on demand. A small amount of further development is needed to provide informative feedback if the user requests images at a position or resolution which the data cannot provide, based on individual data set properties. For very large data sets taken from more than one array (e.g. the
HDF MERLIN+
VLA data) two-stage data processing is required and only one route is currently implemented (combining images in the map plane); other possibilities will be investigated, in due course.
The figures below shows a "radio cutout server" for the
HDF dataset.
Fig 3. An astronomer has interrogated the
MERLIN archive and learnt about the Muxlow et al. observations in the
Hubble Deep Field. The user then enters information about the field ("Offset field 1")
for which s/he wants to generate an image from the visibility data.
Fig 4.
The
AIPS script then generates the image specified by the user on-the-fly and displays a postage stamp showing the resulting image (left-hand image). The data may then be downloaded in
FITS format, and the
right-hand image shows the result of overplotting the contours of this freshly-generated radio image on an optical image
In the case of the
HDF data, only about 1/7 of the total field of view has ever been imaged, but the sensitivity means that more science can undoubtedly be done (e.g. the background radio flux or barely detectable sources at the position of
Chandra sources). The calibrated visibility data are several GB, making it impractical to supply it to off-site astronomers even if they could use
AIPS locally. Thus a Virtual Observatory service is not merely a convenience but a necessity for such data sets.
Work on deriving interferometry metadata standards has proceeded in conjunction with
AVO Interoperability work. A library of terms to describe radio interferometry observations has been drawn up and these have been translated into
CDS UCDs
and work is ongoing to expand these where they are not at present sufficient: a fuller report on this activity can be found at
http://www.jb.man.ac.uk/~amsr/WP5.3/radiometadata.html. Consultations are underway with experts on higher frequency Fourier data to expand the scope of these metadata to ensure that it covers the requirements of interferometry in general, and not just those of radio astronomy.
Under the heading of Task 5.3.6, the investigation of tools for multi-wavelength comparisons, evaluations has been made of existing tools for astrometric alignment: the
MERLIN,
VLA,
WSRT,
EVN,
CFHT and
HST datasets from the
HDF(N) can now be compared using
AIPS and it is possible to produce results in a variety of formats. Additionally, the requirements for the correct treatment of radio interferometry image data in
Aladin have been addressed, in conjunction with
CDS, and contributions have been made to
Starlink projects developing visualisation and astrometry tools. For the
AVO Science Demo, it will be necessary to investigate whether SExtractor, the proposed primary tool for flux density measurements, can be used on radio interferometry images, or whether it is possible to wrap
AIPS into the demo interface. Initial work on this confirms that the FITS images generated from archived visibility data can be manipulated (not merely displayed)
using
Aladin - e.g. to construct colour maps combining radio and optical images - and (with minor flux scale editing) SExtractor can be run on them, to identify sources and measures their parameters.
The prototypes for remote imaging have so far only been tested on-site, or off-site by the developer. They will be opened up for public use soon, which will generate more user feedback, while the lessons
learnt are being shared with other radio observatories (e.g.
ATCA,
WSRT and the
EVN) to aid the development of their archives, and will
influence the design of the forthcoming
ALMA archive.
User feedback from the remote imaging facility:
Tom Muxlow tested the web interface for extraction of images from the
MERLIN+
VLA HDF(N) datasets. He confirmed that the astrometric accuracy is retained, and the image quality is reasonable. However he commented that it takes a long time to return images, and that the user could request maps which are impossible to make.
At present the pilot software is using the method described in "High Resolution Images of Radio Sources in the Hubble Deep and Flanking Fields" (Muxlow et al, MNRAS, 2002). This involves retaining separate
MERLIN and
VLA
multi-channel data (to allow any part of the field of view to be imaged) and the Fourier Transform process is very lengthy. Separate dirty maps are produced, combined and cleaned to produce the final image. The alternative, which is under development now, involves breaking the visibility data up into sub-fields which can be averaged in frequency and time and both data sets can be combined. Images of any region can then be produced directly from the (much less bulky) visibility data set for the appropriate sub-field.
This should speed up the process so that is is comparable with few minutes taken for on-the-fly imaging of other achive data sets (
MERLIN-only, a few days of data at most). As suggested in the user feedback, restrictions and defaults will be added to the UI, and an attempt made to explain simply to the user what is possible, so that the inexperienced user is more likely to specify a reasonable job. Another enhancement will be the provision of access to the FITS image, as well as just a plot.
Several other avenues are being explored to speed up access to the archive pages. A change of image format from JPEGs to PNGs will improve performance, as the latter are significantly smaller files, typically. One of the current performance bottlenecks arises because, when an archive user clicks on a selected observation of a source, the archive server searches directories to see what products are available for that source. This is to allow access to improved or additional products which may have been added since the source was oringinally archived. All the archive pages are created in response to a request to allow up-to-date resource discovery. However this may be slowing down the response. The other possible time-consuming procedure is the need to consult several databases containing different sorts of information (about the PI, about the target etc.). As part of the radio pilot, these and performance bottlenecks in the
MERLIN archive system are being identified.
To summarise, two of the three original
Objectives of this pilot have been met. Protocols and methods for remote access to radio interferometer data have been developed, and a simple interface to provide remote access to selected subsets of these data. The third has not been met yet, as problems with the
aips++ pimager code, means that it has not been possible to prototype parallelization of imaging software for Beowulf cluster.
WP-A5.4: Solar pilot - Data selection from summary information
Objectives:
- To define requirements for easy interrogation of summary information in solar physics.
- To deploy a test bed implementing these requirements to enable selection from a restricted set of datasets
Inputs:
Outputs:
- Test bed system for the efficient selection and retrieval of datasets of interest from data centres.
- Feedback on prototype system from test users.
- Inputs of solar physics requirements to WP-A1.
- Inputs of implementation issues to WP-A2,WP-A3, WP-A4 and WP-A9.
- Inputs to the development of the Phase B plan.
Tasks:
- 5.4.0 Design and Requirements
- 5.4.1 Solicit user requirements from solar community
- 5.4.2 Construct temporary catalogues to use in testbed
- 5.4.3 Define method of retrieving data from sites
- 5.4.4 Select tools to use for UI
- 5.4.5 Construct selection tools to interface to user interface
- 5.4.6 Integrate test bed system
- 5.4.7 Evaluate test bed implementation
Example science case:
A scientist is studying a particular X-class solar flare. S/he wishes to identify and explore the data available which cover its site during the 24 hours leading up to the event. GOES X-ray flux data can be used to identify the timing of the event but the location is often very approximate and will be taken from the reported position of the associated active region. Catalogue data for all observations taken during the required time-period can be used to refine the flare location. This stage may also require access to original data (possibly in the form of quick-look images made on-the-fly) rather than relying on metadata. Due account will need to be taken of solar rotation during the 24 hrs. With accurate time and location parameters, the search for supporting data will be refined. The morphological and photometric history of the region will be
investigated using imaging data and its plasma properties can be characterized from spectroscopic data found to match the event. Magnetogram data will be needed in a form suitable for automatic spatial- and temporal-registration with any monochromatic images available from a variety of space and ground-based sources.
This pilot is principally a test bed for improving methods of data selection in solar physics. This is important because it is not typical in solar physics for all data from a particular experiment to be automatically pipeline-reduced all the way to science-quality products. What is more common is that summary data products are made available, together with catalogues listing observational parameters, which the user may then interrogate, to select datasets containing observations of solar features of the desired sort, and s/he can then request a copy of the processed data from the primary archive for that experiment. The main goal here, therefore, is to develop mechanisms for making it easier for the user to select which datasets are of interest, since this is the stumbling block to doing science with the data. This pilot involves much more interactive work than the others, so it does more to address on-the-fly federation of datasets than the others, which are principally implementing static federations, and much of its work is undertaken in collaboration with the European Grid of Solar Observations (
EGSO).
The crux of the
Design and Requirements study for this pilot (Task 5.4.0) was the definition of the minimal set of parameters required for solar observing catalogues. This resulted in a document issued under the aegis of
EGSO and entitled
"EGSO Unified Observing Catalogues". This was constructed from the perspective of catalogue searches, collating the set of parameters that a user might need to query in order to select a particular dataset, but bearing in mind that the set defined must work within the context of the standard
SolarSoftsoftware package ubiquitous within the solar physics community. In addition to this minimal parameter set, general information about the observatory (instrument description, contact info, etc) should be available to the user: it was decided that the definition of the requirements for such ancillary data should be left to
EGSO,since this information in unlikely to be interrogated via the kind of catalogue queries being prototyped in
WP-A5.4.
Another design issue was the format for storing the solar observing catalogue data. Currently, this is usually held with an
IDL database, for ease of manipulation using
SolarSoft, but there is a desire to remove the reliance on
IDL, to ease the integration of solar physics software and Grid middleware. Another possibility would be to store these data in an XML repository, such as
Xindice, possibly using
VOTable documents for each entry. This would require the addition of further Unified Column Descriptors (
UCDs) relevant to solar physics: this avenue is being explored within the context of
EGSO, and it was decided that the time constraints on the delivery of working prototype software within
WP-A5.4 would necessitate the use of an
IDL-based system for this pilot.
The solitication of user requirements (Task 5.4.1) for this pilot included a wide-ranging questionnaire conducted in conjunction with the STP pilot of
WP-A5.5 and ESA's
SpaceGRID initiative, addressed to a wide cross-section of the international solar system research community and elicting more than one hundred responses. The details of these of responses are beyond the scope of this document - a
summary is provided on the
WP-A5 PilotDocs Wiki page - but a few points from it are worth noting here. This exercise produced some quite explicit performance requirements, not specified so concretely yet elsewhere within
AstroGrid, for example: the system should provide feedback on an action with 30s; simple, online tasks should be completed within an average time of a minute; and complex, offline tasks should be completed within an average time of 24 hours. Interestingly, this survey also identified some Intellectual Property Rights issues not discussed much within
AstroGrid: some respondents thought that there should be a possibility of keeping workflows and query results within
AstroGrid's
MySpace.
Once the minimal set of parameters had been defined, the construction of temporary catalogues to be used in the testbed (Task 5.4.2) could begin. This turned out to be quite a time-consuming business, as it required the merging of pointing and instrument data files in some cases. It was decided to concentrate on federating data from
TRACE and
Yohkoh-SXT, with
SOHO-CDS data as a lower priority task:
TRACE and
Yohkoh-SXT are both imaging instruments, while
SOHO-CDS is a spectrometer, so inclusion of its data would necessitate the definition of further metadata, which was thought to be difficult, given the time available for completion of the pilot. Similar considerations led to the decision (within the context of Task 5.4.3,
Define methods of retrieving data from sites) to concentrate in the first instance on data on-line at MSSL, rather than near-line data at RAL. Finally, these time limitations necessitated the use of existing
IDL tools for the construction of the pilot's user interface and the selection tools to interface to it, rather than the creation of new tools, as originally envisaged as a possibility (Tasks 5.4.4 and 5.4.5).
The integration of the testbed system (Task 5.4.6) within
IDL/
SolarSoft was completed successfully, and the series of screenshots below follows the course of a query using it.
Step 1
The image below shows the start of the selection process. The right hand pane shows a
SOHO-EIT EUV image, marking known features. The left hand pane displays information from three different satellites. At the bottom is the time series of
GOES X-ray flux measurements in a number of bands. In the middle, orange crosses mark
Yohkoh-SXT observations made in various modes during the same time period, but white crosses in the top portion mark
TRACE EUV observations in various filters during that interval.
Step 2
The user notes that the
GOES X-ray data are exhibiting a solar flare at about 14:00. Using a mouse the user can the define a time interval in the left hand panel (shown by the dashed lines) and then, only the image on the right hand panel are plotted the fields of view (FOVs) of all observations (
TRACE in white,
SXT in orange) taken during that interval.
Step 3
The user can then zoom in on the region of interest, by selecting an area using the mouse: the selected region is indicated by the dashed square on the EIT image in the right hand pane.
Step 4
Once that region has been defined, the right hand pane is replotted, showing the FOVs of only those observations whose FOVs intersect with the selected region. The lower portion of the right hand window then reports the temporal and spatial selection criteria the user defined, as well as the number of images from each instrument that these have selected.
At the moment, the user then has to go away to the normal UI for the respective databases and extract the desired data via inputting these selection criteria, but it is intended that this pilot will be extended so that the selection can be made directly from this UI. One slight complication is that this current selection procedure can return a large amount of data - 285
TRACE images in the example above - and it would be desirable to add some additional cadence criteria to the selection (e.g. only extract an image every five minutes, say).
User response:
User comments towards the solar pilot User Interface were
generally favourable, but people wanted more than had been
included in the Pilot once they started to realize what might be
possible. The variety of suggestions confirmed what had
already been concluded - that a single User Interface would never
suit all users and that some means of configuring the interface
on a user-by-user basis may be required. Further thought will be required, to assess the extent to which this can be delivered by
AstroGrid.
Lessons learnt:
(1) Preparation of commonly-used data:
Having a responsive user interface means that common inputs must
have been prepared ahead of time. For example, because the
pointing information is not included in Yohkoh-SXT observing
catalogues, this currently has to be inserted as the catalogues
are read in. This can take a long time and it was found to be the main
cause of the sluggish response of the initial version of the
interface. Experimenting with a metadata form of the catalogue
produced ahead of time largely eliminated this latency.
N.B.: This may work well for the observing catalogues, but it is
almost impossible to have all possible inputs prepared ahead of
time, especially some of the complex ones.
(2) Further data selection constraints required:
The interface required the addition of more detailed selection
criteria in order to reduce the number of "frames" of data that
were to be processed. In essence, it was too easy to select
multiple datasets and some constraints were necessary to reduce
the data request to something manageable. For test purposes a
filter of one-in-three was applied, and the time interval that
could be requested was limited. This will shortly be refined by
an additional selection window that will allow the user to
reject images with low resolution, or a large number of dropouts.
and to define a desired output cadence.
(3) Resource estimation:
A tool to predict the output size of the requested dataset is
required to ensure that unreasonable requests can be intercepted
and refinements requested.
(4) Caching of requests:
A way of storing the request is required so that the user can
return to it at a later time. This allows the user to re-run the
request (e.g. if problems were encountered), add additional data,
or refine the existing selection. During the testing phase, it
was very irritating to have to repeatedly re-enter complex
requests into the User Interface. An XML-based (self-describing)
file might be the best choice here.
In summary, the solar pilot of
WP-A5.4 has successfully met its two original
Objectives. It has (in conjunction with
EGSO) defined the requiremenst for the easy interrogation of summary information in solar physics, and it has deployed a test bed implementing these requirements to enable selection from a restricted set of datasets (in this case, imaging data from
TRACE and
Yohkoh-SXT.
WP-A5.5: STP pilot - Time-series data
Objectives:
- To identify and evaluate implementation options for the efficient query, manipulation and delivery of heterogeneous time series data.
- To implement a test bed system to assess the problems associated with the integration of legacy archives.
- To provide a simple web-based interface that can be used to demonstrate the end-to-end functionality of the test bed.
Inputs:
- Candidate archives - UKCDC and WDC.
- Existing middleware software/libraries and standards - STPDF, XDF, XSIL, CDFML etc.
- Inputs from technology work packages, particularly WP-A2.
Outputs:
- Time series federation test bed.
- Feedback on prototype system from test users.
- Inputs of STP requirements to WP-A1.
- Inputs of implementation issues to WP-A2, WP-A3, WP-A4 and WP-A9.
- Inputs to the development of the Phase B plan.
Tasks:
- 5.5.0 Design and Requirements
- 5.5.1 Develop metadata translation layer
- 5.5.2 Develop a data export layer
- 5.5.3 Implement a simple query layer
- 5.5.4 Implement an authorisation layer
- 5.5.5 Integrate system with Grid middleware
- 5.5.6 Develop a simple web-based UI
- 5.5.7 Evaluate test-bed implementation
Example science case:
A scientist wishes to study the propagation and effect of a coronal
mass ejection. This requires use of: (i) the coronagraph on SOHO; (ii)
upstream solar wind measurements from ACE; (iii) Cluster plasma and field measurements near the magnetopause; (iv) plasma composition measurements in the mid altitude cusp; (v) ring current enhancements, in situ, remote sampling and ground-based geomagnetic indices; (vi) position and timing information. These data sets range from simple scalar time series data, to sequences of images and higher dimensionality arrays. They currently have different locations, query specifications and are returned in different formats. The data may need to be transformed into a consistent co-ordinate frame
or combined to produce ancillary products. A uniform, and flexible, metadata specification is therefore crucial to ensure that manipulation of data from different archives can be done in a consistent and correct way.
The aim of this pilot was to investigate the Grid-enabled
federation of heterogeneous time-series data. This is of particular relevance to Solar Terrestrial Physics (STP) data sets due to the large number of in situ, multi-point and remote sensing measurements made across a wide range of scales in both time and space. STP data sets are relatively small when compared to the other
AstroGrid domains. The main issues for the Grid infrastructure to address come from the complexity of the analysis and in particular the need to locate, search, extract, manipulate and combine multiple data sets. It is also important to consider the international perspective since many of the key datasets that will be required by UK STP scientists originate from non-UK instruments and facilities.
As discussed above, the
Design and Requirements phases of
WP-A5.4 and
WP-A5.5 shared input from the
SpaceGRID solar system research user requirements survey. The results of this survey influenced the design of the
WP-A5.5 top-level architecture, which is sketched below: the rectangles are programs, the ellipses data stores, the dotted lines demarcate abstract entities (i.e. the
Resource catalogue, an arbitrary
Data resource, the
Query handler and the
User interface) and the arrows indicate major data flows.
Where possible, this architecture was to be implemented using existing software, such as the Solar Terrestrial Physics Data Facility
STPDF system, and it was decided that the data sources to be used would be selected from those of the
UKCDC and
WDC, already on line at
RAL. These would comprise about 35 million time series records in total, of several types: from
UKCDC would come data from
ACE, as well as
GOES Key Parameters, while the
WDC would provide geomagnetic indices such as Dst and aa data. The
UKCDC data are held as
CDF files (one per day), while the
WDC data are stored as ASCII or binary tables. The strength of
STPDF is that it can provide a uniform view across this heterogeneous set of data resources.
The development of the metadata translation layer (Task 5.5.1) started with the assessment of a number of XML formats that might be
used in the pilot. A report on this (entitled
"XML for STP data") may be found on
PilotDocs Wiki page, which may be summarised as follows:
XSIL is too simple;
XDF is too complicated;
CDFML is too specific; and
VOTable looks usable. Despite the fact that
VOTable was considered usable for STP work, it was decided to use
ad hoc XML format for the pilot work,
for several reasons. Firstly, as with the consideration of use of
VOTable for solar work in
WP-A5.4, its use in a new area would require the definition of a new set of
UCDs, for which there was insufficient time in the pilot. Secondly, it would be easier to define the restricted DTD or schema in its own namespace, without having to implement the whole of
VOTable. Whilst the metadata representation used here was
ad hoc, it was based on the International Solar-Terrestrial Physics (
ISTP) guidelines already used within the CDF
based datasets held by the
UKCDC.
For the
STPDF software to be able to pick up the metadata for the
WDC data, it was necessary to hand-craft text files, while the
UKCDC metadata could be read straight from their
CDF files.
N.B. SpaceGRID (as part of the [SPASE] collaboration led by
GSFC and also involving
CDPP,
Southwest Research Institute and
PDS) are defining a space physics query language, which will yield a data dictionary likely to become the default for use in STP: this effort is starting from low level terms (i.e. Dublin Core) and then building up domain-specific metadata within a namespace. The
WP-A5.5 pilot is constructing domain-specific descriptors based on the
SpaceGRID work, and it is suggested that these would have to be included in any more general
AstroGrid ontology via an STP namespace.
The development of the data export layer (Task 5.5.2) was simplified by the decision to use
STPDF, since that would deliver the required functionality. Users are provided with three classes of output:
- (i) metadata descriptions of the required data in XML: STPDF writes this information out to CDF files, which are then converted to XML using standard tools;
- (ii) listing of time intervals resulting from a query, again translated into XML from ASCII produced by STPDF and then post processed using software developed during the pilot;
- (iii) the selected data themselves, in CDF format.
The different datasets could be stored in different formats, but data are exported from the pilot using
STPDF, which can generate the required
CDF files on-the-fly.
A simple query layer (Task 5.5.3) was defined, based on a set of four query types:
typeset - what type of data;
time - of observation;
location - where the instrument was; and
target - where the target of the observation was. This is accessed via a query interface
which allows selection of system or user datasets (i.e. the results of previous operations, thereby allowing compound queries to be built). The UI is dynamically updated to show available fields, and the number of records for each selected dataset. Four basic operations are supported: select/output fields; query data set; find/select time interval; and time series join (nearest neighbour).
It was decided not to implement an authorisation layer (Task 5.5.4),
but rather have the pilot only handle public data. Task 5.5.5 (
Integrate system with Grid middleware) was also not really tackled, since the pilot does not use Grid middleware, as such, based as it is on
STPDF. It was hoped that some testing of the use of Globus/GridFTP would be possible, but, in the event, the resources available for the pilot were insufficient to attempt that. Instead the GNU wget tool was used to transfer data from a public ftp area on the query server for visualisation on the quicklook server. However,
SpaceGRID is intending to put a web services or OGSA wrappers around what has been produced in this pilot: Science Systems Ltd are leading the development of the core
SpaceGRID services, with the
RAL team providing the domain expertise and supplying the data access services to several legacy archives.
The full pilot system looks like this:
and involves three data servers at
RAL:
- The AstroGrid Server: is a Sun workstation, with STPDF installed on it. The local reference directory and data catalogues provide information stored on the local server. The master reference directory provides information about the location of network-accessible resources. In the case of the pilot work, this contains entries for the UKCDC and WDC servers.
- The WDC server: provides access to its data holdings via the STPDF server. The STPDF system handles the translation from the underlying data format and the sub-setting of the data.
- The UKCDC server provides access to its holdings in the same way as those of the WDC server, described above.
A web-based query is dynamically generated from information extracted from the
STPDF system, plus additional metadata not available from it. The query builder generates an XML query file that is passed to the query translator, which handles the interface to the
STPDF system.
STPDF then handles the requests for data from each of the required archives and returns a result. That is formatted and returned to the user through the web server.
The final task (Task 5.5.6) comprises the development of a simple WWW-based Quicklook UI to view the selected dataset(s).
The following set of screenshots illustrates the operation of the prototype UI (based on the
UKCDC UI), from the specification of the query (shown above).
A Quicklook User Interface can be launched,
with which the user can upload data into the Quicklook system, either by fetching a result from the query server or by uploading a local file. The file, or just the metadata, can be displayed as
XML or loaded into the quicklook plotting system. A second UI
allows the user
to select a time range subset of the dataset and choose which
parameters to plot.
The example plot below shows a joined dataset comprising of a years worth of Dst data from the
WDC and
ACE magnetic field data from the
UKCDC.
The user can then zoom in on a portion of the plot. The example below shows such a zoom, of a magnetic storm in early May 1998, while below that is a screenshot showing an ASCII dump of the same data.
User response:
The system has currently only been tested by local users. The
ability to apply queries to long periods of data and to generate
compound queries involving multiple datasets was of particular
interest. Even with the limited number of datasets currently
accessible the system was considered to provide a useful aid
for event identification. The user interface was found to be
somewhat confusing and might be better spread over a number of
separate pages for the different functions. It was also rather
easy to generate queries that were beyond the capabilities of
the pilot system that would time out before completion. A clear
mechanism for restricting the time range over which a query
is applied would assist in constraining queries. It would be
useful if the quicklook system could produce XML output for
the selected period, currently only flat ASCII is offered.
Lessons learnt:
Having an end-to-end testbed on which to try out different
implementation options has proved extremely useful and the
pilot system will continue to be used for future testing,
including development of web and grid services.
It is clear that metadata handling remains one of the biggest
challenges when dealing with heterogeneous data resources.
Although the pilot started to investigate these issues as part
of the design phase, when it came to implementation, certain compromises had to be made. These resulted both from limitations
in the metadata available for some of the data sets and in the
capabilities of the STPDF toolkit. In order to achieve the goal
of automated data manipulation (e.g. units conversion and
coordination transformation) further standardisation of the
metadata. This will be particularly important for derived
products where the metadata will need to be generated on the
fly based. The pilot also showed that it was very easy to
generate queries that were beyond the capabilities of the
system. Query estimators will help but as the complexity
of the queries increases it becomes more difficult to accurately
assess the time and resources required. The tests showed that
the query specification interface should have default values
that restrict the size of a query. Secondly it is important to
have good job control and monitoring that allows the progress
of a large query to be examined by the user and if necessary
allow the clean termination of the request.
In summary, the STP pilot of
WP-A5.5 has met all three of its original
Objectives: it has identified and evaluated implementation options for the efficient querying, manipulation and delivery of heterogeneous time series data; it has implemented a test bed system to assess the problems associated with the integration of legacy archives; and it has provided a simple web-based interface that can be used to demonstrate the end-to-end functionality of the test bed system.
Future Plans
Work on all five
WP-A5 pilots will continue to the close of the extended Phase A, at the end of December 2002. For the two that were suspended -
WP-A5.1 and
WP-A5.2 this will entail the completion of the original set of tasks, while, for the remaining three, the work will centre of the production of web service interfaces to the prototype systems they had developed, so that these services can be used as testbeds within AstroGrid's trial data-grid.
Appendix
An Appendix will be added to this document at a later date, to record the results of the work described in
Future Plans above, undertaken during Q5 of Phase A, ending in December 2002.
--
BobMann - 30 Sep 2002