Suggested syntax for Grid Data Handles

Presume that we have user data scattered around MySpace on various servers. Presume also that the data are grouped into logical files such that each logical file can be copied to a real, physical file on some computing system. Allow that there may be more than one copy of the data of each logical file in the data-grid.

An application then needs two ways of naming a file. It needs to be able to say, naming a logical file at unknown locations in the data-grid, "I want one of those". It needs to be able to say, naming a specific location in the data-grid, "give me that one" or "put this over there". Following the terminology of the OGSA specification, we could call these two names Grid Data Handle for the abstract one and Grid Data Reference for the concrete one.

It seems clear that a Grid Data Reference ought to be a URL. URLs have the right semantics, identifying one logical file at one location. It would be perverse to invent something different.

It seems plausible that the Grid Data Handle should be a URI, since we have software for parsing URIs in most environments. The GridDataHandle shouldn't be a URL, since it's independent of location. It could be a URI that looks like a URL; e.g. we could use the syntax of an HTTP URL and use the address part to denote namespace. This is already common practice for denoting XML namespaces. However, using something that looks like a URL but doesn't work like a URL is confusing for everybody.

UniversalResourceNames (URNs) may be what we need. This description of is taken from [http://www.dlib.org/dlib/february96/02arms.html][D-Lib Magazine, February 1996].

RFC 1737 lays out functional requirements for URNs. It also makes recommendations about the form that such names might take. An updated version of RFC 1737 is under discussion, but, with some important clarifications, the following list of requirements has been widely accepted.

"Global scope: A URN is a name with global scope which does not imply a location. It has the same meaning everywhere.

"Global uniqueness: The same URN will never be assigned to two different resources.

"Persistence: It is intended that the lifetime of a URN be permanent. That is, the URN will be globally unique forever, and may well be used as a reference to a resource well beyond the lifetime of the resource it identifies or of any naming authority involved in the assignment of its name.

"Scalability: URNs can be assigned to any resource that might conceivably be available on the network, for hundreds of years.

"Legacy support: The scheme must permit the support of existing legacy naming systems, insofar as they satisfy the other requirements described here. ...

"Extensibility: Any scheme for URNs must permit future extensions to the scheme.

"Independence: It is solely the responsibility of a name issuing authority to determine the conditions under which it will issue a name."

URN syntax is described by RFC2141. Basically, a URN starts with "urn:", followed by a string stating a namespace, followed by a colon, followed by an identifier for the logical file. This is a valid URN:

urn:my-namespace:my-test-file

What should the name-space part be? I suggest that it should identify the entire MySpace network of interoperable servers. This might encompass the whole global VO, or there might be separate networks in various parts of the VO. Therefore,

urn:ag-myspace:...
seems good for now and we can make just it
urn:myspace:...
later if need be.

What should the file-identifier be? This depends on how we expect the internal structure of MySpace to appear to users and programmes looking in. If we agree, as suggested above, that a URL can refer to specific locations in MySpace, then we have a Unix-file-system like syntax built into the URL. These URLs could refer to a file in MySpace.

http://cass123.ast.cam.ac.uk:8080/AGMySpace/gtr/my-directory/my-file.dat
gsiftp://cass123.ast.cam.ac.uk/AGMySpace/gtr/my-directory/my-file.dat
file:/AGMySpace/gtr/my-directory/my-file.dat
I.e., There is a single file-system hierarchy rooted at AGMySpace that can be exposed on the net or locally through various protocols.

I suggest that we agree a single directory hierarchy for MySpace and dictate that every MySpace server in AstroGrid's data-grid supports the same hierarchy. The identifier for a file is then its position in the hierarchy. Note that not all MySpace servers will have physical copies of all the branches of the hierarchy, but they should all understand where to put a file given a URN.

Therefore, these would be possible Grid Data Handles for things in AstroGrid's MySpace.

urn:ag-myspace:/AGMySpace/private/gtr/my-directory/my-file.xml
urn:ag-myspace:/AGMySpace/published/directory/file.xml
I'm supposing that everything starts at "AGMySpace" so that there is an obvious root for the tree in a physical, local file-system on a server. I'm guessing that there might be two areas, one for private files and one for published data, but this is not a requirement.

There is one problem with this. Technically, the URN standard synatx doesn't allow forward slashes in identifiers. If we use normal file-tree syntax then our URNs aren't interoperable with other putative URN systems. However, since we don't expect MySpace URNs to be meaningful to systems other than MySpace this may be a reasonable compromise. The alternative is to replace the slashes with the ASCII hex-code for forward slash. This would make our URNs standard but harder to reader and to type.

-- GuyRixon - 20 Sep 2002

Topic revision: r3 - 2002-09-20 - 13:02:27 - GuyRixon
 
AstroGrid Service Click here for the
AstroGrid Service Web
This is the AstroGrid
Development Wiki

This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback