[DRAFT: 2002-03-16]
Comments from the AstroGrid consortium on
"Data Requirements for the Grid: Scoping Study
Report" (Draft 1Ab: 2002-02-08)
Summary:
This report provides a very useful summary of the
exercise, performed at the request of the Database and
Architecture Taskforces of the UK e-science core programme, to
identify the requirements for creating, maintaining and accessing
data in a Grid environment, and contains a wealth of useful
information, which will be very valuable in the development of
the services required to integrate databases into the Grid. Its
author, Dave Pearson, is to be commended warmly for producing
such a detailed and wide-ranging report, and for seeking input to
it from such a broad cross-section of the UK e-science community
during the requirements analysis exercise that he conducted: this
document certainly does meet its stated aim of providing an
introductory overview of the importance and role of data in the
Grid.
The report explicitly states that it does not
describe any requirements in sufficient detail to be the sole
input for designing solutions, and that it is intended only as an
input to a programme to refine and prioritise requirements so
that the design and prototyping of software components can begin,
but we feel that two significant changes are required before it
can fulfill that role properly. Firstly, by describing "the
requirements for creating, maintaining and accessing data in a
Grid environment", the report implictly conflates
requirements placed on two distinct communities, namely the
curators of the data to be accessed through the Grid and the
developers of the Grid middleware required to perform that
access. The Database and Architecture Taskforces may be able to
direct the work of the latter group, to some extent, but their
influence over the former is very little, probably no more than
advocating protocols and standards that the data curators should
adopt if they want their data to be accessible via the Grid. In
the light of this, it would be useful if the report stated
explicitly to which community each requirement is addressed, so
that those requirements on Grid middleware developers could be
clearly identified and prioritised, and the requisite labour
marshalled by the Database and Architecture Taskforces. Secondly,
the structure of the document could be improved significantly if
the text were numbered at the subsection or paragraph level, and
if requirements identified within it were themselves numbered and
made obvious by bold text or some other means. In that way, it
would be much easier to check that Appendix III really is a
complete summary of the requirements, without which assurance it
is impossible to start prioritising them.
Further comments are given below - general
comments first, then specific comments, in the order in which
they arise in the text. Their number should not be taken as
detracting from the fact that we welcome this report warmly, and
will gladly contribute to its further development, and its
translation into a plan of design and prototyping work, in so far
as that impinges on the work of AstroGrid.
General comments:
It would be very useful to have a
discussion of what it means to say that a database is
"on the Grid" and what aspects of a database
service differentiate it from other Grid services. That
would help focus on what additional functionality is
required to integrate databases into the Grid, over and
above that to be provided by generic Grid toolkits.
A clear distinction should be made
between "the Grid" as the sum of the resources
accessible through Grid middleware and "the
Grid" taken to denote that middleware. As discussed
in the Summary, the conflation of these two entities, and
of the communities behind them, means that much of what
is discussed in this report is clearly the concern of
data service curators, rather than the developers of Grid
middleware: that does not mean that these are not
requirements worthy of identification, but for them to be
useful it is necessary that it be clear to whom they
apply. For example, there is a lot of attention to
questions of access control, which will obviously be a
major issue when data is accessed through the Grid, but
the setting up of policies and protocols to restrict
access is surely the task of the curator of the data
service, and the only requirement on Grid middleware is
that it supports the transfer of sufficient
authentication information as required to implement those
policies through protocols.
There was some concern that implicit in
this document (and others) is an assumption that XML is
the transport medium for data in the Grid, while some
applications involving astronomical databases will
necessarily involve such large data volumes, that a more
efficient (e.g. binary) format would be highly
preferable.
Maybe this is simply a reflection of the
inputs received, but the report seems to give
insufficient emphasis to discussing what the Grid will
enable people to do with data resources that they are
unable to do now. For example, it would be good to see
more of a discussion of the mining of "virtual
databases", and the requirements that will impose on
hardware: here, in particular, the report (or a
subsequent development of it) would benefit from a more
quantitative approach, scoping the likely data volumes
and transfer rates likely to be required by data resource
users on the Grid.
As noted in the Summary, the lack of
systems for numbering the sections of the text and the
requirements made it difficult to match the two, so,
although we note below what we believe to be requirements
missing from Appendix III, we cannot claim any great
completeness for this list.
Some of the entries in Requirement column
of the table in Appendix III do not seem to be
requirements, as such: e.g. "Contextual
metadata" - there can be a requirement that there is
contextual metadata or that contextual metadata provides
certain functionality, but "contextual
metadata" by itself is not a requirement,
grammatically speaking.
Specific comments:
- p. 11, final paragraph of Data Source section.
There seem to be requirements identified here which are
not contained in Appendix III, to do with: the capture of
output from a data source connected to the Grid; the
import of output from a data source that is not
connected; and the ability to integrate output with
existing data in a Grid environment - this last seems
particularly important.
- p. 11, first paragraph of Data Resource section.
It is not clear that a data resource in a Grid envionment
has to be a persistent data store. For example,
a common sort of operation within the Virtual Observatory
(VO) context is likely to involve a join between two (or
more) geographically-distributed databases. Completion of
the operation may require the further analysis of the
combined table at some other location in the Grid where
some particular hardware or software resides, but the
combined table may not be of interest after that analysis
is complete. So, data resources in the Grid may well be
transient; indeed, the freeing of resources is likely to
require them to be explicitly created with finite
lifetimes.
- p. 12, last paragraph of Data Resource
section.This paragraph also seems to contain missing
requirements. The requirements for being able to manage
data resources online is particularly important in
astronomy where that management is happening remotely:
for example, in radio astronomy, a key requirement for
the VO will be to allow users to process visibility data
stored at a remote data centre.
- p. 12, third paragraph of Database section and
first paragraph of the Data Formats and Precision
section on p. 13. The requirement to create a virtual
database from a set of data resources with a
heterogeneous range of structures is the key component of
the VO, as noted in this paragraph, but this requirement
appears to be missing from Appendix III.
- p. 13, first sentence. There was some disquiet with the
implication that XML is the only way of defining
semi-structured data - for example, the FITS header
beloved of astronomers can be regarded as semi-structured
data.
- p. 13, third paragraph of the Data Formats and
Precision section. It might be worth spelling out
that FITS stands for the Flexible Image Transport System
and to give a URL for FITS; e.g. http://fits.gsfc.nasa.gov
.
- p. 14, Data Types section. It may not be very
important, but it was unclear to some what the
significant distinction is between results data
and derived data.
- p. 18, fourth paragraph of Data Resource
Characteristics section. This paragraph identifies
some important requirements which are missing from
Appendix III. Surely one of the crucial differences
between the sort of database queries executed now and
those that will be prevelant in the VO, say, is much
greater use in the latter of geographically-distributed
resources. In this situation, it is likely that there
will be replicas of important datasets located at
different sites (and possibly using different indexing
schemes), so it would be highly desirable for there to be
a service which assesses which of the replicas of a given
dataset to use in a particular query, considering the
network access path to it (via, say, a load model, or
firing test queries to it) and, ultimately, knowledge of
how the data are stored - for example, if one replica is
unusual in being indexed on the particular attribute
being tested (e.g. object size, in the astronomical
context) then this might return a result much faster than
other replicas which might be located nearer to the user,
in a network sense.
- p. 18, final paragraph. In addition to restricting access
to provisional data, a key requirement in
astronomy is the restriction of access to data that is proprietary.
- pp. 17-23, Metadata section. It is not
immediately clear where this should go, but an important
concern for astronomy is the availability of metadata
describing data quality. For example, imagine undertaking
a large, optical sky survey over several years. During
the course of that time, observational conditions will
vary considerably, so one is likely to wish to set
criteria for which conditions are considered acceptable
for inclusion of data in the final sky survey data
products. Many nights of data might be excluded under
these criteria, but they might be perfectly acceptable
for some analyses: a typical data quality measure in
astronomy is the "seeing", which is the size
that a point-like image becomes after atmospheric and
instrumental blurring. For some analyses, this will be
crucial, while for others, it may be very unimportant.
So, data quality in astronomy is not a binary
"good" or "bad", and what is needed
in the VO is a means of characterising data so that an
agent performing a query can assess whether a given
dataset is adequate for the particular analysis. This
sort of requirement does not appear to be covered in this
Section.
- pp. 23-25, Provenance section. Not clear what
the significant distinction is between Versioning
and Provenance - isn't the former one way of
expressing the latter?...in which case, do the two need
to be discussed separately?
- p. 28, Data Publishing and Discovery section.
Maybe worth starting with a few words to explain that Publishing
is here being used in the most general sense of "making
available", without any implication that it
means appearance in a journal, book or whatever.
- p. 28, second paragraph of Data Publishing and
Discovery section. It is slightly incorrect to say
that AstroGrid is intending to provide data curation
facilities: the responsibility for curating astronomical
data will continue to reside with the expert data centres
(many of which are represented in the AstroGrid
consortium, of course), and AstroGrid will provide the
means of federating those resources into the VO, as well
as suggestions for "best practice" for
data curation, so as to aid that federation.
- p. 30, third paragraph of Data Discovery
Functionality section. Should be able to exclude, as
well as include, which catalogues to use in a search.
- p. 30, last sentence. This is a very important
requirement, which should be given much more emphasis:
for virtual databases like the VO, it will be crucial
that users can display samples or compute statistical
summaries so that they can decide whether a particular
combination of data is interesting or not.
- p. 32, Specifying the Target section. It might
be good to note here (as discussed later, on p. 33) that
the Grid should be able to judge which is the best
replica to use in any situation.
- pp. 32-33, Specifying the Retrieval Conditions
section. Some of the material covered in this section
touches on the issue of query languages, and that is
surely a sufficiently large topic - especially for
virtual databases with heterogeneous components - as to
deserve more attention here and a section of its
own?...it is addressed to some degree in the sentence
starting "Third, when more..." on p.
33, but it deserves more emphasis.
- p. 34, third paragraph of Data Analysis and
Interpretation section. The final sentence here
constitutes further missing requirements.
- p. 34, next paragraph. Ditto.
- p. 34, final paragraph. This reads like something that is
too ambitious to contemplate as a generic functionality -
recording the download of all data from the Grid,
notifying all Grid users when data they have downloaded
has changed, and providing them with the means of
reconciling their own copy of a dataset with the revised
master copy.
- p. 36, fifth paragragh of Methods of working with
data section. The first sentence of this paragraph
identifies a requirement missing from Appendix III.
- p. 36, next paragraph. Ditto.
- p. 38, fourth paragraph of Data Lifecycle
section. Would be very good to have some quantitative
information here - e.g. for the volume of temporary
storage likely to be required.
- p. 39, end of penultimate paragraph of Data
Publishing and Discovery section. This is a good
example of the conflation of the concepts of the
"the Grid" as the sum of the resource
accessible via the network and "the Grid" as
the middleware making possible that access: it is surely
solely the concern of the data centres that offline
copies of patented, invalid data are kept.
- p. 39, next paragraph. The desired meaning of the word
"archive" here is obscure - at least
to an astronomical readership.
- p. 39, first paragraph of Data Management Operations
section. Implicit in the opportunties described in this
paragraph is the requirement that there be a means of
controlling access to data resources on the Grid, so that
it is not brought to a halt by, say, a set of concurrent
database queries each yielding unusably large quantities
of data; but maybe this is an example of a requirement
identified within the context of a data service that is
really just a generic Grid service requirement.
- p. 39, final two sentences of the third paragraph of Data
Management Operations section. This is most
definitely a requirement on data centres, not the
developers or maintainers of Grid middleware.
- p. 41, first table of Appendix I. Affiliation for Robert
Mann should be Institute for Astronomy, University of
Edinburgh.
- p. 41, text following that table. Further details of
AstroGrid will not be found at the EPSRC WWW site, as it
is funded through PPARC: details can be found at www.pparc.ac.uk or on
the project's own WWW site, www.astrogrid.org .
- p. 41, second table of Appendix I. Affiliation for Edwin
Valentijn should be Kapteyn Institute, Groningen.
- p. 52, first Requirement in Data
grouping. The storage of numerical data to highest
precision is a requirement on the data centre, only the
transport of those data to highest precision is really a
Grid requirement.
- p. 52, Modify data content - insert, update, delete
requirement. In a similar vein to earlier comment, the
only aspect of this which is really "Griddy" is
enabling it to be done remotely, surely?
- p. 52, first three requirements in the Access Control
grouping. As noted before, surely these access control
issues are concerns of the data centre, and all the Grid
needs to do is to make available the authentication
information required to apply such policies for the case
of access by a particular user.
- p. 58, Appendix IV. We make no comment on the
prioritisation, until there is a clarification of
requirements upon whom are to be set here, and
until the generation of a complete list of requirements
is made possible by the numbering of sections of text.