PhaseAReport

(1) Project Vision

(1.1) The AstroGrid Vision

The Virtual Observatory will be a system that allows users to interrogate multiple data centres in a seamless and transparent way, which provides new powerful analysis and visualisation tools within that system, and which gives data centres a standard framework for publishing and delivering services using their data. This is made possible by standardisation of data and metadata, by standardisation of data exchange methods, and by the use of a Registry which lists available services and what can be done with them. The Registry should embody some kind of "ontology" which encodes the meaning of quantities in the databases served and the relationships between them, so that user software can for example collect fluxes at various wavelengths from various databases and then plot a spectral energy distribution.

The long term vision is not one of a fixed specific software package, but rather one of a framework which enables data centres to provide competing and co-operating data services, and which enables software providers to offer a variety of compatible analysis and visualisation tools and user interfaces. The first priority of AstroGrid, along with the other VO projects worldwide, is to develop the standardised framework which will allow such creative diversity.

However, the intentions of AstroGrid go beyond this framework. We will develop a working implementation of immediate use to astronomers. As a consortium of data centres and software providers, we will pool resources, including key UK databases, storage, and compute facilities. On top of this, the AstroGrid project per se will provide the first data services, along with a standard "point of entry" user interface, and a set of datamining tools. AstroGrid will also provide central resource on top of that provided by the participating data centres - first and foremost the construction and maintenance of an Astronomical Registry, but also one or more data warehouses, further CPU dedicated to search and analysis tools, and storage and software to create "MySpace", a kind of virtual workspace for grid-users.

Implementing such a functioning VO capability will support UK astronomy in several ways. It should make doing astronomy faster, more effective, and more economic, by standardising the data analysis process and by freeing the astronomer from many mundane tasks. It also has the potential to influence the discovery process in astronomy in a dramatic way - by encouraging new styles of data-intensive exploratory science, by removing interdisciplinary barriers, and by encouraging the pooling of resource and the formation of distributed collaborative teams. We also expect that it will be a liberating force in that the resource available to astronomers will become almost independent of their location.

(1.2) Development of the Project

AstroGrid has its origins during the Long Term Science Reviews (LTSR) undertaken by PPARC in 1999/2000, which placed IT initiatives in astronomy in general, and large database initiatives in particular, as high priorities in all the panel areas. (Similar ideas were developing across Europe and the US, and for example construction of a US "National Virtual Observatory" was recommended by the NSF decadal review). Meanwhile e-science and the Grid played a large part in Government thinking during the 2000 spending review, and an AstroGrid project concept developed by astronomers from Leicester, Cambridge Edinburgh and RAL was used by PPARC in its bid. A "white paper" on AstroGrid was reviewed by PPARC Astronomy Committee in October 2000, and debated around the community. The result was an expansion of the consortium to seven institutions, and an increased remit to cover solar and solar-terrestrial physics as well as optical, IR, X-ray and radio astronomy. A formal proposal was submitted to the PPARC e-science AO in April 2001, and a funded project finally began in September 2001. Initial funding was for a one-year Phase A study, with final project funding to be determined by a review at the end of Phase A.

During Phase A we have concentrated on the following main activities. (i) Requirements analysis, including community consultation, development of key science problems, and articulation as formal use cases. (ii) Development of a UML-based architecture. (iii) Technology assessment reports. (iv) A series of small software demos to test ideas and show them to others. (v) Development of interactive collaborative web pages - a static portal, a News site, a Forum site, and a Wiki for collaborative construction of documents and software. This document is a report on those Phase A activities, accompanied by a Phase B plan, with this section (Project Vision) being an overall summary of where we are and where we are headed. The Phase A study is being reviewed by PPARC's Grid Steering Committee in Oct 2002, following which we expect to begin our construction phase at the beginning of 2003.

(1.3) General Science Drivers

The scientific aims of AstroGrid are very general and can be summed up as follows :

  • to improve the quality, efficiency, ease, speed, and cost-effectiveness of on-line astronomical research
  • to make comparison and integration of data from diverse sources seamless and transparent
  • to remove data analysis barriers to interdisciplinary research
  • to make science involving manipulation of large datasets as easy and as powerful as possible.

The first driver is then to accelerate the quality of on-line research. Astronomers already do much of their research on-line through data centres. The idea is to step up the quality of service offered by those data centres, beyond simple access to archives by downloading subsets. This will mean the ability to make complex queries of catalogues of objects or catalogues of observations, and the ability to analyse the data in situ - for example to transform or pan across an image, or to draw a colour-colour-colour plot for selected objects and rotate it. Such improved service can be seen as part of a long trend in astronomy to develop communally agreed standard tools so that the astronomer can concentrate on doing the science rather than wiring their own instruments, or hacking their own data reduction software. Following facility-class instrumentation, then facility-class data reduction tools (Starlink, IRAF, Midas etc), then easy access to data and information (on-line archives, ADS, Vizier, etc), the next step is facility-class analysis tools. However we are also driven to this solution by the expected data explosion in astronomy. For very large datasets, such as the optical-IR sky survey which VISTA will accumulate at hundreds of TB per year, users can't afford to store their own version, or have time to download it. Data centres are therefore driven to provide analysis services as well as data access.

Along with improved query and analysis tools, the next driver is the ability to make multi-archive science easy. The study of quasars requires data at all wavelengths; finding rare objects such as brown dwarfs involves simultaneous searching in optical and IR data; study of the solar cycle involves putting together data from many different satellites over eleven years or more; and so on. There is increasing interest in combining data from different disciplines, such as linking solar observations of coronal mass ejections to changes found in monitoring of the Earth's magnetosphere. The idea is to transform this kind of science from slow and painful hand-driven work to push-button easy, so that through a single interface one can make joint queries such as "give me all the objects redder than so-and-so in UKIDSS that have an XMM ID but don't have an SDSS spectrum", or ask higher-level questions, such as "construct the spectral energy distribution of the object at this position". Sometimes the tasks will involve predetermined lists of data services, but often they will involve the AstroGrid system making a trawl and deciding what is relevant, using some kind of registry of services.

As well as offering improved data services, and multi-archive services, we wish to facilitate data intensive science. Some of the most interesting science comes from manipulation of huge numbers of objects. This can mean looking for rare objects, for example those with strange colours or proper motions, or constructing a correlation function, or fitting gaussian mixtures to N-D parameter sets, and so on. At the moment such projects are the province of specialist "power users", but the vision is to make such analysis easy, as a service through data centres. This will require data centres to provide not just storage but also high-powered search and analysis engines. In addition, we need to develop standard tools for such kinds of analysis, and a way for users to upload their own algorithms to run on the data. We see all this as empowerment of the individual in astronomy. One doesn't need to be at Caltech or Cambridge to have the very best resources at one's finger tips. AstroGrid and other VO projects will not provide everything needed for this new kind of science - many others will invent the algorithms and write the software tools - but we need to put the framework in place to make this possible, and to provide at least some tools in our early working system.

(1.4) Specific Science Requirements

The previous section summarises the general scientific aims of AstroGrid. There is no specific scientific topic driving the project - the infrastructure should serve a whole range of present and future scientific concerns. However, in order to actually build the system we need concrete requirements, and in order to construct these we need to look at some specific scientific questions in detail. We therefore collected a series of Science Problems and analysed them in a fairly formal blow-by-blow manner. These were both contributed by Project members, and collected by community consultation. There were too many of these to use as formal requirements for the system architecture. We therefore selected the AstroGrid Top Ten, chosen to represent a range of science topics, and to encapsulate the key recurrent technical issues. From these we then developed more formal use cases and sequence diagrams to feed into the architectural design. The Top Ten Science Problems used were :

  • Brown Dwarf Selection
  • Discovering Low Surface Brightness Galaxies
  • The Galaxy Environment of Supernovae at Cosmological Distances
  • Object Identification in Deep Field Surveys
  • Localising Galaxy Clusters
  • Discovering High Redshift Quasars
  • The Solar-Stellar Flare Comparison
  • Deciphering Solar Coronal Waves
  • Linking Solar and STP events
  • Geomagnetic Storms and their impact on the Magnetosphere

(1.5) The Virtual Observatory Concept

The science drivers described above are closely related to the popular concept of a "Virtual Observatory", especially the ideas of multi-archive science, and transparent use of archives. The idea can be summed up in one sentence. The aim of the Virtual Observatory is to make all archives speak the same language.

  • all archives should be searchable and analysable by the same tools
  • all data sources should be accessible through a uniform interface
  • all data should be held in distributed databases, but appear as one
  • the archives will form the_Digital Sky_

To this now standard VO vision, AstroGrid adds the desire that more advanced analysis and visualisation tools should be available for studying the digital sky, and that high-powered computational resources should be available for undertaking data intensive studies

(1.6) The Grid Concept

The "Grid" concept originally referred to computational grids, i.e. distributed sets of diverse computers co-operating on a calculation. However, the idea has expanded to refer to a general sense of transparent access to distributed resources, and a sense of collaboration and sharing. The resources which are shared could be storage, documents, software, CPU cycles, data, expertise, etc. The term "Grid" is an analogy with the electrical power grid. Spread over the nation there is a network of huge power stations, but the user doesn't need to know how to connect to them. One simply plugs one's hair-dryer into the socket, and electricity flows. The history of computing can seen as an evolution towards the Grid concept. First came the physical networks, and the protocol stacks, to enable us to pass messages between computers. Next came the World Wide Web, providing transparent sharing of documents. Then came computational grids enabling shared CPU. A popular concept now is that of a datagrid, making possible transparent access to databases. This is close to the Virtual Observatory concept, but to truly reach this ideal, we believe that what we need is a service grid. This involves not just open access to data sources, but also standardised formats and standardised services, i.e. operations on the data. Beyond this, the Grid community talk of information grids, knowledge grids, and Virtual Organisations_.

The general Grid idea of transparent access to resources is then central to AstroGrid and the VO concept. At first sight our vision of a service network, where data access and computations are provided by one data centre at a time, and results are combined by the client, doesn't seem to embody the Grid version of pooled managed resources and communal collaboration. However, what we expect is that the collaboration and pooling will be by consortia of data centres, on behalf of the community, to give the best possible service to users. Therefore although we don't often expect to make diverse computers collaborate on calculations, we do expect, within our consortium, to route queries to multiple nodes, in awareness of the various hardware resources and their state at the time, and to establish dynamically updated mirrors and warehouses of our combined key databases. This will need a collaborative approach to resource and fabric management, job scheduling and job control, and so many of the key Grid concepts and software technologies will be of direct relevance. We also need dedicated high speed networking between collaborating data centres.

(1.7) The technical route forward

Standards, Standards, Standards. Our prime targets for progress are as much sociological as technological. We have to evolve agreed standard formats for data, metadata, provenance, and ontology. Astronomy has actually been in the vanguard of data standardisation, with the FITS format, bibcodes, and so on, but we now must go further, and need to produce XML-based standards to fit into the commercial computing world. Obviously this cannot be done by AstroGrid in isolation, but by international discussion. A key step forward has been the recent development of VOTable, an XML-based format for table data. Provenance refers to recording the history of where data has come from, who has touched it, which programs have transformed it and so on. This is already normal in good astronomical pipelines, but not standard in archives. As results are extracted from data analysis servers, and passed on to other services and so on, recording this history will become crucial, and we need to agree standard formats for recording such data. Ontology refers to recording the meaning of columns in a database, and the relationships between them. A familiar problem is receiving a table with a column labelled "R-mag" and not knowing whether it refers to a Johnson, Gunn, or Sloan R, let alone whether the normalisation is as a Vega magnitude or an AB magnitude. Ideally we want not just to agree terminology for specific quantitities, but to specify their relationships in order to allow software inference using the data. CDS Strasbourg have made an excellent start in this area with their huge tree-structured list of Universal Column Descriptors (UCDs) but we need to improve these ideas and translate them into new XML-based ontology markup languages (DAML and OIL, and eventually the emerging W3C standard OWL).

Internet Technology. To construct a VO, we need to take advantage of several developments in internet and grid technology. The first requirement is protocols for exchanging and publishing data. The idea of web services has almost solved this problem, with XML data formats, SOAP message wrappers, and Web Service Description language (WSDL). The problems are that standard web services are one-to-one, stateless, and verbose, so we need to add methods for linking to bulk binary data, for composing multiple services with lifetime management, and for defining and controlling workflow. However, before some portal software can connect a user to web services, it needs to know of their existence, which requires their publication in a Registry. There is a developing commercial registry called UDDI, but its structures match poorly onto Astronomy, so we will write a specialised AstroGrid Registry. As well as simply advertising service availability, the Registry will collate coverage information and other metadata (including ontology) from available datasets, so that many queries, and the first stage of all queries, can be answered directly from the Registry before going to the remote service. The next requirement is a method of transmitting identity, authorisation, and authentication to achieve the goal of single-sign-on use. One doesn't want a trawl round the world's databases to stop thirteen times and come back and ask you for another password. There is a variety of commercial solutions to this problem but for a variety of reasons they are not appropriate for astronomy. We have chosen to follow the Community Authorisation Server (CAS) model from Globus, using X.509 certificates and standardised distinguished names. Finally, there is the issue of managing distributed resources, implicit in our intention to act as a consortium of data centres - job control, resource scheduling, query routing and so on. These are the key issues in the developing world of Grid Technology, from which we will select and deploy as necessary.

To enable the kind of data intensive science we envisage, we need to take advantage of improved datamining algorithms and visualisation techniques. This is an important area for AstroGrid, as it provides added value to the basic multi-archive data services framework. To a first approximation, it is the job of scientists world-wide to invent new algorithms, of a variety of sofware providers to realise these as software tools, and of participating data centres to implement them as services. However, in our effort to kick-start the VO world, AstroGrid will work on the development of example techniques. Also of course, the portal software needs to understand the kind of services available in order to provide an interface to them.

There are two simple technical issues which dictate the structure of the framework that we set up. The first is the I/O bottleneck. Some problems are limited by CPU-disk bandwidth, which has grown much more slowly than Moore's law, and some are limited by seek time, which has hardly changed at all. This means that searches and analyses of large databases take extremely long unless high throughput parallel facilities (clusters and/or multi-processor machines) are used, along with innovative and efficient algorithms. The second issue is the network bottleneck. Networks are improving but are in practice limited by end-point CPUs and firewalls rather than fibre rental, and are not expected to be nearly good enough to routinely move around the new large databases. Given that users can't realistically download large databases, or have room to store them, or have the search and analysis engines required, we are driven to a situation where the data stay put, but the science has to be done next to the data. In other words, data centres have to provide search and analysis services - the motto is shift the results not the data.

The above conclusion, together with the fact that the human expertise on any new and exciting set will usually live next to the data, dictates the geometry of AstroGrid. We do not want a super centralised archive. Neither will we have a truly democratic peer-to-peer network like Napster, or a hierarchical system like the LHC Grid. Rather, what we have is a moderate number of competing specialist data service centres and a large number of data service users. AstroGrid itself offers a specialist Registry service as well as a portal. In the future other organisations could offer competing registries. The purist model of independent data centres is in practice likely to be complicated by collaborations between those services. (The AstroGrid consortium is precisely such a collaboration.) For example, very often astronomers will want to cross-match sources in different catalogues on the fly, which seems to require either shifting data across the net, or a single location data warehouse. In fact we expect that collaborating data centres, as opposed to users, will be connected by dedicated fat pipes, and an intelligent approach to cross-matching can minimise traffic. We also expect that as part of natural competition, any one data centre could choose to offer a warehouse with many catalogues, although typically this would not be the latest version of a currently growing archive such as SOHO, HST, or UKIDSS.

(1.8) World context

AstroGrid is not an isolated project. Firstly it is connected to a variety of UK e-science projects from whom it can take both lessons and actual software. The two most important examples are GridPP and MyGrid. GridPP, the UK contribution to the LHC Grid project, is by far the most advanced in terms of actually constructing a working Grid, and will be the prime source of experience and software for resource management and job control. MyGrid is a UK e-science biology project. The requirements of bio-informatics are very similar to those of astronomical e-science, with a variety of heterogeneous databases, an increasing need to search and analyse multiple databases, and a desire to manipulate large amounts of data, along with an even stronger emphasis on metadata and provenance. In addition, computer scientists involved in the MyGrid project are world experts in ontology, an area we are sure will grow in importance for astronomy.

Next, AstroGrid has good working connections with the UK e-science core programme centred around the National E-Science Centre (NeSC) in Edinburgh and Glasgow, the Grid Support Team at RAL, and regional centres in Belfast, Manchester, Cambridge and UCL. The most important connection is that with the OGSA-DAI project. OGSA (Open Grid Services Architecture) is a joint Globus-IBM project aimed at merging the ideas of the web services world and the Grid world. The UK programme has identified structured database access over the grid as a key problem which the UK will lead, through a GGF working group, and by forming the OGSA-DAI (Database Access and Integration) project. AstroGrid and MyGrid have been declared "early adopters" of OGSA-DAI products, and we are already working with the team.

Finally AstroGrid has close relations with other VO projects worldwide. The two most important are the US-VO project, and the EU funded Astrophysical Virtual Observatory (AVO) project. The AstroGrid consortium is formally a partner within AVO, which funds two posts in Edinburgh and one at Jodrell Bank. In return, a similar number of PPARC funded FTEs within AstroGrid are available as effort to AVO. AstroGrid has special responsibility for technology development within AVO. Our work packages are carefully aligned to maximise the joint usefulness of work done.

Specific implementations (such as AstroGrid) of working datagrids, user tools, portals, and data services do not have to be globally identical. The framework being developed should encourage creative diversity. There can even be rival registries. However we do have to evolve towards a situation where the underlying infrastructure of standards, protocols, and key software elements are universal. To this end, since late 2001, the three major funded projects (US-VO, AVO, and AstroGrid) have held both joint workshops and monthly Lead Investigator telecons. In June 2002 at the Garching conference "Towards an International Virtual Observatory" we officially formed the International Virtual Observatory Alliance (IVOA) , agreed a Roadmap, and added members from further nascent projects in Germany, Australia, Canada, and Russia. The IVOA is certain to grow in importance.

(1.9) Project Methodology

The project began with a fairly standard workpackage structure. However we soon decided to run the project along the lines of the Unified Process. This means being use-case centric, architecture driven, and iterative. The formal architecture is being developed in the Unified Modelling Language (UML). We began constructing formal blow-by-blow use cases, but soon found that before this was possible we needed one layer of abstraction above, formulating Science Problems from which use cases can then be articulated. We collected a large number of these, and so picked the AstroGrid Top Ten science problems, selected, not as the most important, but as representing a good spread of the kinds of problems we need to solve. The formal approach to architecture is important, but on the other hand, the aim of iteration is that the project design does not freeze too early, but converges along with implementation in quarterly cycles, while code developers remain agile.

Working as a distributed project is not trivial. To this end, we developed several web-based collaboration tools. All of them are interactive, with registered members able to make postings as well as read entries. The first is a News site. The second is a Forum aimed at discussion of technical topics. But the most interesting is the AstroGrid Wiki, where we jointly develop documents, record meeting minutes, collate links to other work, deposit code, and so on. Any member can directly edit any of the web pages. This has been an enormous boon to productivity, and to keeping track of developments.

(1.10) Building the VO

To create the Brave New World, several strands of work are needed - by the VO projects, by data centres, and by astronomical software providers.

(a) The VO projects will work together to develop agreed standards, for data formats, tables, metadata, provenance, and ontology. This is already happening as a natural consequence of the work programmes of the various VO projects, which are creating de facto standards. (Eventually they should achieve a more formal endorsement through the IAU.) A significant amount of AstroGrid effort will be expended in this direction.

(b) Each VO project has expended substantial effort in research and development, and in assessing new technologies. For the AVO, and for Astrogrid Phase-A, this has been the main purpose of the work to date, and for AVO will continue to be so. For Astrogrid Phase-B we will be concentrating on implementation, but we still expect continuing R&D at approximately 20% of staff effort. Partly this is because of the iterative converging nature of our software development process (see below), but it is also necessary because both the commercial and the academic technologies that we are building on are changing rapidly. Also of course, we need to be in a strong position for whatever e-science work follows on from completion of AstroGrid.

(c) Next, the VO projects will be developing software infrastructure - components such as a job scheduler, data router, query optimiser, authorisation server, registry, and so on, along with choices of technology such as SOAP and WSDL, OGSA, OIL, etc. Building this infrastructure will take the largest part of the AstroGrid Phase-B staff effort. The software components used by VO projects worldwide do not have to be globally identical, but in practice as we exchange experience, they are likely to converge. However, the major VO projects, while starting at much the same time, have different timescales. US-VO is a five year project. AVO is a three year R&D project, with the intention of an ensuing Phase B build phase. AstroGrid is intending to complete a software infrastructure in three years. We expect that a large fraction but not all of the AstroGrid code will still be used in later VO work.

(d) Once the necessary standards and software infrastructure are in place, and on the assumption that at least one registry is constructed and maintained, then Data Centres around the world can publish data services, i.e. can make available queries on, and operations with, their data holdings. This implies some work by those Data Centres to play the game, but this will be seen as being as normal and as necessary as today writing an organisation's web pages is seen to be. The Data Centres will establish and maintain their data in whatever format they like, and will build engines to deliver queries, analysis, visualisation, or whatever, but will use the VO-provided infrastructure to build a standard interface to their services.

(e) In order to actually do science with the data returned we will need some front end tools - for example some kind of portal; tools to view images, spectra, time series, etc; tools to plot spectral energy distributions from returned multi-wavelength data and to fit models; tools to make N-D plots and rotate them; and so on. Likewise the services offered by data centres need to offer not just data extracts, but data manipulation tools such as Fourier transforms, cluster analysis, outlier detection, and so on. Most such tools will not be developed by the VO projects, or even by the data centres, but by a variety of software providers all over the world, just as now, but with the addition that such tools will need to be VO compatible.

(f) The first three strands above are work to be done by the VO projects, whereas the latter two strands (publishing data services and developing data mining algorithms) represent work that will be done by many different organisations and individuals worldwide. However, AstroGrid, as well as being a VO development project, is a consortium of data centres and software providers, and expects to develop an early working system. This is partly to act as a proof of concept and exemplar to other future users of the VO infrastructure, but also to build a tool of real daily use to scientists. This will mean constructing and maintaining a Registry, writing data services for key databases of UK interest (.e.g. UKIDSS, SOHO, XMM and so on), a user portal, and some user tools and datamining tools. Full development of such tools is too large a job for AstroGrid, so we are likely primarily to adapt existing tools, such as Gaia or Querator.

(g) A working implementation such as that described above has to run on real physical resources. The final task of the AstroGrid project is therefore to establish and manage a physical service grid. Much of the resource (data, storage, search engines, analysis engines) will be provided by the data centres that are members of the AstroGrid consortium, but further resource - some storage and CPU - will be supplied by the AstroGrid project per se. This will be to establish one or more data warehouses, to maintain and operate the AstroGrid Registry, and to provide "MySpace", a virtual storage and workspace system for AstroGrid users. We will also be investigating how to maximise the bandwidth between the participating data centres.

(1.11) Goals of the AstroGrid Project

In summary, these are our SCIENTIFIC AIMS :

  • to improve the quality, efficiency, ease, speed, and cost-effectiveness of on-line astronomical research
  • to make comparison and integration of data from diverse sources seamless and transparent
  • to remove data analysis barriers to interdisciplinary research
  • to make science involving manipulation of large datasets as easy and as powerful as possible.

And these are are our top-level PRACTICAL GOALS :

  • to develop, with our IVOA partners, internationally agreed standards for data, metadata, data exchange and provenance
  • to develop a software infrastructure for data services
  • to establish a physical grid of resources shared by AstroGrid and key data centres
  • to construct and maintain an AstroGrid Service and Resource Registry
  • to implement a working Virtual Observatory system based around key UK databases and of real scientific use to astronomers
  • to provide a user interface to that VO system
  • to provide, either by construction or by adaptation, a set of science user tools to work with that VO system
  • to establish a leading position for the UK in VO work

-- AndyLawrence - 28 Sep 2002

Topic revision: r7 - 2002-12-19 - 16:04:00 - AndyLawrence
 
AstroGrid Service Click here for the
AstroGrid Service Web
This is the AstroGrid
Development Wiki

This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback