r8 - 19 Dec 2002 - 17:19:19 - AndyLawrenceYou are here: TWiki >  Astrogrid Web  >  DocStore > PhaseBDocs > RbInteroperabilityReport

PhaseAReport

(6) Interoperability Report

(6.1) Interoperability

In the Virtual Observatory (VO) context the word "Interoperability" is understood to cover all issues relating to the combination of the resources (databases, software, computational resources, etc) needed to construct a comprehensive VO that appears as a seamless, coherent whole to the user. Interoperability therefore covers the definition and use of standards across a range of areas, some of which are specific to astronomy (e.g. standards regarding the representation of astronomical data in different archives) and some of which (e.g. data transport protocols) are common to all users of the rapidly-developing computational infrastructure of the Internet. The key to interoperability is the collaborative definition and widespread use of agreed standards, and, as described below, AstroGrid has followed this approach in both the astronomy-specific and generic infrastructure arenas during Phase A. AstroGrid members have played an active role in the definition of VO standards (such as VOTable (1)) and the project has adopted an architecture which reflects current best practice as regards the adoption of relevant standards defined under the aegis of bodies such as the W3C(2) and the GGF(3): the former has been advanced through a series of meetings with colleagues from AVO(4), US-VO(5) and the new International Virtual Observatory Alliance (Strasbourg, Jan 2002; Garching, June 2002; Strasbourg, Harvard and Baltimore, October 2002), while the latter has been informed by participation in a number of meetings organised by the National eScience Centre (NeSC, (6)), and attendance of GGF meetings.

(6.2) Computational Infrastructure Interoperability

As detailed in the Grid Technology section, the core to ensuring interoperability with developing computational infrastructure is the adoption of a service-based architecture, using web services where appropriate, and grid services, on the OGSA(7) model, where "statefulness" is necessary. AstroGrid is committed to this approach, which also seems favoured by the other VO projects (although their more relaxed timescales mean that they have not had to make firm technology choices yet), so that a picture is emerging of the VO mediated by the transfer of SOAP (Simple Object Access Protocol, (8)) messages.

(6.3) Astronomy-specific Interoperability

Given accepted standards for interoperability across computational infrastructure, it behoves the VO community to agree the standards and protocols that are required to integrate astronomical resources into a VO using that infrastructure. Such standards are required in a number of areas, the most important of which are discussed below, namely:

  • Resource discovery and the resource registry
  • Data archive queries
  • Specifying compound VO operations via workflows
  • Results returned from VO operations
  • Metadata associated with datasets.
  • Access to resources

Progress to date on defining them has been good, with one concrete success - VOTable, an XML standard for presenting tabular data in astronomy - and a widespread acceptance that standards must be developed collaboratively within the VO community, with the International Virtual Observatory Alliance (IVOA) as the authority endorsing such developments.

(6.3.1) Resource Discovery and the Resource Registry

VO queries may require a search of a number of data sources which may be located at many different archive sites, and the most general possible VO queries - such as "give me all information that is known about the object at position X" - implicitly assume the possibility of querying more data sources than the user knows about, so there is a clear need for some mechanism for locating all the data sources relevant to a given query. The number of such sources is unlikely to be very large, so it seems appropriate that this function be performed via interaction with an astronomical Resource Registry, rather than the use of something like a WWW search engine.

The need for a astronomical Resource Registry was recognised early on by the AstroGrid Project, and is now agreed by our partners in the IVOA. There seems a good prospect for agreement on a single but replicated Resource Registry which can be used by all VO projects in the world. The times-scales of the other projects, however, appear to be more relaxed than ours, so this is an area in which AstroGrid probably needs to take a lead. There are a number of issues that have to be addressed in the design of such a registry.

Granularity:
A user may require information about a number of facets of a given data source before knowing whether it is relevant to a particular query, for example:

  • Name of service, URL, physical location, etc.
  • Web interfaces supported (CGI, ASU, SOAP, WSDL, ...)
  • Type of holding (source catalogues, images, spectra, photometry, observing logs, etc.)
  • Waveband (radio, IR, optical, UV, X-ray, etc.)
  • Sky coverage (see notes below)
  • Any access restrictions (e.g. by date of observation, maximum download volume)
  • Spatial resolution (of images) or typical positional error (source catalogues)
  • Sensitivity (e.g. limiting magnitude)
  • Data volume (table sizes, image sizes, etc.)
  • Export formats supported (e.g. FITS, VOTable, CSV, PNG, GIF, PS, PDF)

In principle, the registry could contain just the top-level URLs of each data archive site, and all this information could be obtained via repeated querying of the site's web services descriptions (assuming use of WSDL) by the user's portal. This may be inefficient, so we favour a richer Registry, the entries within which may contain all the information listed above; however, there may be situations in which a multi-step interaction between the user's portal and the registry is preferred - see the discussion of Sky Coverage below

The storage of this registry information will clearly require a DBMS of some type, but the data volume will be relatively small, and since the information will all be imported and exported in XML formats, perhaps an XML-based DBMS such as Xindice(9) would be suitable. The registry clearly needs to be a replicated resource, and it also needs to take account of the fact that many popular datasets are present at a number of points on the web, since this information may be important for a VO query optimiser which can decide which of a set of replica datasets is the best to use for a given query posed by a user at a particular location. The work of keeping a detailed resource registry up-to-date is significant, but if all the individual archives use SOAP for their interfaces and WSDL for their service descriptions, then it might be possible for a robot to update the registry at regular intervals, for example every 24 hours. Search engines use similar methods to maintain their indices based on considerably more heterogeneous collections of web pages, so this ought to be feasible.

Sky Coverage:
One aspect which needs more thought is whether it is possible to store detailed information about sky coverage in the Registry. For surveys the coverage limits are usually fairly simple shapes, limited by declination or galactic latitude, but many observatories will have data arising from a large number of individual pointings, and it is not yet clear what is the most efficient way of making this coverage information available.

This may be one situation in which it is more efficient to have a multi-step interaction between the user portal or the registry and the data archive, rather than having all the relevant information stored in the registry. For example, for something like the HST archive, it may make more sense for the Registry sky coverage entry to be "whole sky (sparse)" with a procedure for querying the observing catalogue in more detail, than to have the Registry entry include a list of thousands of WFPC2 field centres, each a couple of arcminutes in size. Whatever the location of the sky coverage information, it is likely that it will have to be expressed in some hierarchical format, since it is required on a wide range of scales - from, say, a whole hemisphere (in the case of an astronomer wanting to find an optical sky survey catalogue suitable to use in finding counterparts for a northern sky radio survey) to a fraction of an arcsec (for a user wanting target positions for a spectroscopy proposal on an 8m telescope).

Possible technical solutions:

  • UDDI: The commercial world's solution to the problem of registering web services is UDDI (Universal Description, Discovery and Integration, (10)), an initiative led by IBM, Microsoft and SAP and advanced within the W3C framework. Whilst solving a somewhat similar problem to that motivating the astronomical resource registry, it does not look as if UDDI (at least in its current form) will be of use to the VO community. This is primarily due to its commercial roots, which mean that it contains hard-wired categories with no relevance to astronomy - for example, "yellow pages" classifications using standard business taxonomies, like the North American Industry Classification Scheme.

  • AstroGLU(11): This is a software package created by CDS (Strasbourg) and subsequently used there and by NASA/GSFC. It contains a resource directory with much of the required detail, using the GLU (Générateur de Liens Uniformes, (12)) system for symbolic service names instead of hardcode URLs. In its current form, it has to be maintained by hand, and uses CGI interfaces, so an upgrade to handle XML-based standards and automated updating would be required;
it is thought that a web service upgrade to GLU is in preparation.

  • RDF: The Resource Description Format(13) is a W3C recommendation for structured metadata and may form a good basis for future work, but as yet the work seems not to have progressed as far as having a database of RDF information.

These and other possibilities will be pursued further in Phase B, in collaboration with our IVOA partners.

(6.3.2) Data archive queries

Existing data archives provide many common features typically using a CGI-based query mechanism. The AstroGLU system from Strasbourg has a translator from a uniform set of CGI parameters into the terms needed by a number of major archives, but this is a work-around rather than a standard and is not very scalable. There now seems general agreement that we need to move towards a XML/SOAP/WSDL interface which can be implemented at each archive site. The US-VO has put forward an interim standard for a cone-search (14) web service (finding all sources in a cone of small angle around a given celestial position) and a number of prototypes have recently been set up. AstroGrid is in the process of setting up compatible services at Cambridge, Edinburgh and Leicester.

Less progress has been made on a standard for more advanced queries. Most if not all data archives use a DBMS which speaks SQL, but the SQL standard is both inadequate and poorly implemented in practice. Nearly all of the queries used in our DBMS evaluations (covered in the Database Technology and Data Mining report) had to be modified to suit the different DBMS under test. In addition, functions that are going to be frequently used, such as that for great-circle distance between points on the celestial sphere, need to be encapsulated in the query language, and the implementation of user-defined functions is very DBMS-specific. There is widespread agreement that some form of Astronomical Query Language (AQL) will be needed, with translators for the several SQL implementations used in archives around the world, but no standard has yet been developed. More work is needed on this in Phase B. Fundamentally, what is required is a standardised data-selection service, that provides a translation between the peculiarities of individual databases (both different DBMSs and different instances of them in different data centres, with different schemas) and some Grid-friendly format.

(6.3.3) Specifying compound VO operations via workflows

Querying a database is only one class of VO operation. If the VO maxim of "ship the results, not the data" is to be followed, then it must be possible to construct compound operations to be run within the VO; for example, a query on a set of databases, followed by running some analysis algorithm on the set of results, after they have been shipped to a common location. The construction of "workflows" like this is a common requirement within e-science, and in computing generally, so it likely that AstroGrid (and the VO) will not define the necessary interoperability standard in this case, but rather adopt an existing standard which can be made to work for the particular case of astronomical operations. That said, initial experiments are being conducted within AstroGrid to use XML fragments in SOAP messages to provide inputs for complex services - for example, supplying sets of input parameters with which to run some piece of code - to start to assess the workflow functionality required for the VO.

(6.3.4) Results returned from VO operations

The most obvious and serious weakness of existing web-based systems of querying multiple data archives has been the heterogenous nature of the results which are sent back. The VO community therefore recognised the need for the standardisation of results formats at an early stage, and this process was launched with the definition of an XML standard for tabular data, called VOTable, which has now been endorsed by the IVOA. AstroGrid members were fully involved in this activity, which was undertaken through a round of email debates, followed by a meeting in Strasbourg in January 2002 to hammer out the final details required for release of the specification for VOTable version 1.0. Clearly, there is a need to repeat this procedure to produce agreed standards for the other data formats required in the VO (e.g. pixel data) and initial discussions about these are starting to commence within the VO community.

The use of XML as the basis for VOTable was motivated by several considerations. Firstly, XML is the lingua franca of the web services world, so its use would aid the interoperability of the VO and external computational infrastructure. Secondly, use of XML enables applications to validate a VOTable document readily, using standard rules, which is something that FITS cannot do so readily, as many people who have written pipeline reduction systems driven by FITS headers can attest. Thirdly, the existence of the XSLT (eXtensible Style Language Transformation, (15)) standard means that result sets in VOTable format can be easily transformed into other formats, as required.

Another important interoperability feature is that VOTable requires not only the tagging of all physical quantities with their units, but also the use of Uniform Content Descriptors (UCDs, (16)), which express the nature of the quantity. The set of ~1500 UCDs were abstracted from the column names of the ~3000 tables included in the VizieR(17) system at CDS, and they provide a means of recognising synonyms, as well as a preferred taxonomy for use in astronomical databasing. The derivation of the UCDs from VizieR means that the current set of UCDs bears the same biases and selectivities as the VizieR system itself. So, one of the tasks of the radio astronomers within AstroGrid and AVO has been to develop the additional UCDs required for interferometry data. Similarly, as described in the Pilot Programme Report, the solar and STP pilots assessed the utility of VOTable in their respective areas, noting that, while additional UCDs would have to be defined before it could be used, it should be suitable for their requirements once they have been.

One criticism frequently levelled at XML in relation to its use in astronomy is that its inherent verbosity means that data files in XML format are many times larger than they would be in, say, FITS binary format. VOTable provides a means of circumventing that, however, as it allows metadata and data to be stored in separate files, but linked according to the Xlink model. This has many advantages - for example, processes can then use metadata to 'get ready' for their input data, or to organize third-party or parallel transfers of the data - but it is somewhat unsatisfactory to have to employ two methods to manipulate a VOTable document: an XML parser for the metadata and some less standard tool to handle the binary file. There are XML schemas for binary data being developed - for example, the Binx(18) proposal developed as part of the OGSA-DAI(19) project - and it is possible that one of these can be applied to extend VOTable in such a way that an application can manipulate a tabular dataset of any size, without needing to know whether it is all written in XML or has a binary file linked to it.

(6.3.5) Metadata associated with datasets.

The FITS Standard is very widely used and allows an unlimited amount of (scalar) metadata, but it is recognised that conventions for its use are inadequate and AstroGrid members are active in the discussions within the VO community concerning the definition of the metadata systems required for the VO. One area where FITS is clearly inadequate is its treatment of provenance information. The FITS standard allows for any number of "HISTORY" tokens to be placed in the FITS header to record the provenance of the contents of the file's data units, but there is no convention for writing them and so the information they contain is not readily extracted by any means other than being read by a human. This is clearly inadequate in the era of the VO, in which it is desirable to have machine-readable metadata. For example, AstroGrid's MySpace concept allows for the storage of the results of VO operations, which could be complicated workflows as well as single database queries, and it would be very useful to be able to interrogate these result sets regarding their provenance, so that, for example, a complicated analysis does not have to be repeated if it has already been performed.

(6.3.6) Access to resources

Interoperability standards are also necessary to provide access to resources in the VO. The ability to access to computational resources at distant sites is the very essence of the Grid, so this is not an area where AstroGrid expects to develop the fundamental standards, but rather to adopt them. However, it is essential for the VO community to develop a consistent way of using these standard protocols (such as the Community Authorization Service described in the Grid Technology report) to implement an agreed policy for access controls in the VO. For example, most observatories impose proprietary periods on data, and, even if all agree to implement these using some Grid-standard digital certificate protocol, there must be agreement about how the existence of data access restrictions is made apparent within the VO and about exactly what credentials must be included in the digital certificate to effect access to a given resource.

(6.4) Ontology

Cutting across the whole area of Interoperability is the concept of "ontology". In this context an ontology is an explicit formal specification of the terms in a domain and the relations between them, and the reason that ontology is of relevance here is that it can provide the conceptual framework to ensure that interoperability is implemented in a meaningful way. At some level, the VO will have an ontology, whether implicit (assuming astronomers' common knowledge) or explicit, and, for example, the set of UCDs form some sort of an ontology, since they define the relations between the column names in the VizieR system.

A more all-encompassing ontology could be used within the VO, however, and AstroGrid staff are investigating this possibility, both in conjunction with colleagues from other VO projects and with ontology experts from other disciplines, notably the bioinformaticians in the myGrid(20) project (with which AstroGrid already has a collaborative relationship, by reason of these two projects being the chosen "early adopters" of OGSA-DAI deliverables).

References


(1) VOTable: http://cdsweb.u-strasbg.fr/doc/VOTable/
(2) World Wide Web Consortium (W3C): http://www.w3.org
(3) Global Grid Forum (GGF): http://www.gridforum.org
(4) Astrophysical Virtual Observatory (AVO): http://www.eso.org/avo
(5) US Virtual Observatory project: http://www.us-vo.org
(6) National eScience Centre (NeSC): http://www.nesc.ac.uk
(7) Open Grid Services Architecture (OGSA): http://www.globus.org/ogsa
(8) Simple Object Access Protocol (SOAP): http://www.w3.org/TR/SOAP
(9) Xindice: http://xml.apache.org/xindice/
(10) Universal Description, Discovery and Integration (UDDI): http://www.uddi.org
(11) AstroGLU: http://simbad.u-strasbg.fr/glu/cgi-bin/astroglu.pl
(12) Générateur de Liens Uniformes (GLU): http://simbad.u-strasbg.fr/glu/glu.htx
(13) Resource Description Framework (RDF): http://www.w3.org/RDF
(14) US-VO cone-search: http://us-vo.org/metadata/conesearch
(15) Extensible Stylesheet Language Transformations (XSLT): http://www.w3.org/TR/xslt
(16) Unified Column Descriptors (UCDs): http://cdsweb.u-strasbg.fr/doc/UCD.htx
(17) VizieR: http://vizier.u-strasbg.fr/viz-bin/VizieR
(18) Binx: http://www.epcc.ed.ac.uk/~gridserve/WP5/Binx
(19) Open Grid Services Architecture Database Access and Integration (OGSA-DAI): http://umbriel.dcs.gla.ac.uk/NeSC/general/projects/OGSA_DAI
(20) myGrid: http://www.mygrid.org.uk

-- BobMann - 30 Sep 2002

Edit | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r8 < r7 < r6 < r5 < r4 | More topic actions
 
AstroGrid Service Click here for the
AstroGrid Service Web
This is the AstroGrid
Development Wiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback