r3 - 19 Apr 2002 - 15:53:16 - ClivePageYou are here: TWiki >  Astrogrid Web  >  DocStore > WpDocs > WorkPackages > WP-A4

WP-A4: Database Technology

The original definition of this work package is:

Objectives

To establish what is required of the DBMS technology which will under-pin the virtual observatory.

To evaluate a range of DBMS solutions for their suitability.

Inputs

Experience of relational and OO DBMS at AstroGrid sites and others.

Overall science requirements and use-cases from WP-A1.

The existing structure of a range of data archive sites in the UK and elsewhere.

Existing ad-hoc standards for astronomical metadata and for astronomical server inter-working such as FITS, ASU, and GLU.

Software systems available from other disciplines.

Inputs from other work packages.

Tasks

Given the overall science requirements from WP-A1, establish the detailed requirements for a DBMS in terms of data storage, management, and querying, and external interfaces, taking account of the need to provide scalable data mining facilities on multi-processor systems.

Develop a few simple benchmark problems from use-cases based on existing astronomical datasets, such as large catalogues, for these evaluations.

At the data storage level: study the options for storage of astronomical data in various formats including FITS and XML. The problems to be studied include efficient access to binary files, sky tessellation, multi-dimensional indexing, the preservation of legacy data, and the generation of homogeneous metadata.

Participate in international efforts, with our partners in the AVO and NVO projects, to define standards for astronomical metadata. This is a joint activity with WP-A2 and WP-A5.

Evaluate a number of DBMS of various types, including relational, object-oriented and object-relational. Examine and where appropriate evaluate GIS, statistical packages, data warehouse solutions, and XML-based DBMS.

Evaluate the INFEO system developed for searching distributed Earth Observation catalogues and the Isite tool used in the NERC Metadata Project and supported by NASA with a view to determining if parts of them are suitable for use in astronomy.

Evaluate a short-list of solutions in parallel hardware environments such as SMP and Beowulf clusters. Problems to be tackled include federation of datasets over the wide area network, the suitability of SQL or OQL for astronomical queries, and the handling of metadata.

Examine and evaluate middle-ware solutions for the layer between the astronomically oriented user interface and the standards-based DBMS. This will be done in collaboration with WP-A2.

Solutions based on SOAP and Java should be included.

Outputs

Produce a document listing the virtual observatory database requirements.

Report on the options, benchmark results, and recommendations for DBMS technology to be installed at AstroGrid sites in Phase-B. It may be appropriate to have separate recommendations for solutions suitable for retro-fitting to existing or "legacy" sites and for those more suitable for "green-field" sites such as VISTA.

Report on software developments required during Phase-B, especially in areas such as data mining and data exploration (e.g. correlation, population modelling, outlier discovery). We have good links with statisticians and computer science groups in Belfast, Edinburgh, Pittsburgh, and Penn State and elsewhere which should facilitate collaborative developments in this area.

The Database Technology work-package essentially covers any aspect of AstroGrid where there is a need for persistent storage of structured information. This means that the areas of interest have evolved considerably as the design of the overall architecture has developed.

Inputs

  • Science cases and use-cases from WP-A1.
  • Information from related projects especially AVO, US-NVO, and EGSO and various interoperability discussions and workshops.
  • Visits to other sites including Jodrell Bank and MSSL to get information on requirements in radio, solar, and STP areas.
  • Information from other grid projects, especially from liaison with the DBTF.
  • Information on the technology used in a number of existing astronomical data archives.

Main tasks

  • Interoperability: the federation of existing data archives depends on the development of standards for the interchange of all types of data. Fortunately all the VO projects see this as a high priority, and there has been general agreement that these can best be developed by adopting the middleware protocols forming part of the Web Services paradigm: XML, SOAP, WSDL, and associated standards. Members of the WP-A4 team have played an active part in the workshops and discussions which have led to the development of the VOTable proposal for the encapsulation of tabular results in XML. Further work is needed to extend this to cover homogeneous datasets such as images, and to develop standards for queries. We have also had useful contacts with EPCC where a group is working on methods of encapsulating binary data in XML.

  • Architecture: we have been involved in extensive discussions of the overall architecture. Although many details remain to be resolved, it is now generally agreed that the solution involves portals to provide the user interface, a resource discovery mechanism, and a data warehouse to support I/O-intensive or cpu-intensive processing such as data mining.

  • Resource register: it has gradually become clear that some form of central register of the data resources available to AstroGrid needs to be set up. Ideally this would be a joint resource for the world's VO projects. It is not yet known whether some existing system such as UDDI is suitable; if not than an XML-based database management system might provide the required facilities. If the resources were all interfaced via WSDL, this register could be updated automatically.

  • Astronomical Data Warehouse: the facilities provided by many existing data archives are limited to answering queries about points or small patches of sky. The most valuable scientific results will arise from joint use of two or more large datasets, for example the cross-identification of sources at two or more wavelengths, and from statistical investigations and data mining operations on large datasets. This brings up a number of problems which are areas of current study:

    • Indexing the sky: joining tables on imprecise celestial coordinates, the so-called fuzzy join problem is difficult (or at least extremely slow) when using standard business-oriented DBMS. Possible solutions include two-dimensional indexes such as the R-tree provided with (or available as add-ons to) DBMS, and mapping functions to one-dimensional indexes such as HTM and HEALPIX. We plan to evaluate these.

    • Query languages: SQL is poorly matched to the many types of astronomical queries, and is best considered as an advanced API to a database. We may need to develop an astronomicall-oriented query language, which can be translated into SQL or XQUERY.

    • Metadata preservation is vital in all astronomical data processing. Standards such as FITS and UCD have been developed to this end, but persuading DBMS to preserve metadata is not easy.

    • Using parallel hardware: it should be possible to spread a many type sof database and data mining operation over the nodes of a cluster (Beowulf or similar) but we need to get experience of these and develop suitable algorithms. We expect to procure hardware shortly to begin work.

    • Managing workspace: users of the data warehouse will want to copy standard datasets from remote sites and generate their own new datasets. The MySpace concept has been devised to handle the latter. A database may be needed to manage these resources.

    • Evaluation of DBMS: we plan to evaluate a number of free and commercial DBMS using sample astronomical datasets. Plans for this are now at a fairly advanced stage.

    Plans for the astronomical data warehouse involve working closely with WP-A3 over data storage and computational resources, and with WP-A2 over grid software which will provide its authentication and authorisation facilities, and software such as gridFTP to copy large datasets efficently.

Outputs

  • Status reports on DBMS activities generally.

  • Report on sky indexing methods.

  • List of test datasets, list of DBMS to be evaluated, and evaluation criteria for the DBMS evaluation exercise.

  • Assessment of DTI Database Task Force plans and initiatives.

  • Reports on each of the DBMS evaluations.

  • Proposals for an astronomical database query language.

  • Draft requirements for the resource registry, and an outline of possible technical solutions to be circulated around all VO projects.

-- Clive Page - 19 Apr 2002

Edit | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r3 < r2 < r1 | More topic actions
 
AstroGrid Service Click here for the
AstroGrid Service Web
This is the AstroGrid
Development Wiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback