r2 - 27 Nov 2006 - 15:15:08 - DaveMorrisYou are here: TWiki >  Astrogrid Web  >  DaveMorris > VoSpace20061127

VOSpace views, formats and protocols

This started as a reply to Pauls email, but it is probably a bit big to be readable in an email. I'll post a short reply to the list and put the rest of the text here.

Paul Harrison wrote:
> On 23.11.2006, at 05:16, Matthew Graham wrote:
> 
> >
> > 2. Views
> > This needs to be renamed to what it actually is, i.e. format(s), since
> > the current name is universally confusing.
> 
> yes! - but I think it is really two concepts - on import it is just a  
> statement of the format of the data - on export it is a request to  
> convert the data to a format. The whole transfer object in the WSDL  
> should reflect this difference by not using the same data structure  
> for requests and returned information, as this is also confusing IMHO  
> as the distinction is lost between the statements
> "this data is in format x"
> "convert this data to format y"
> 

Yep, I agree with that. I suspect that we ended up with 'view' in the import methods because of history.

This originally came from the work we did at Cambridge in June, where we started looking at different back end implementations and what they would want to offer. In particular, we were thinking about a relational database service that could be used to store tabular data.

A simple file based system always has the option of just returning the original file, but a database system can't. Once the data has been transferred into a table, the system no longer has the original file.

So, a database system would always have to generate the output from the data in the table. Which is where the name view came from.

As far as I can rememebr, the same element name was used for the import messages to make it possible to write a simple non-parsing client, that could take the XML from an export response and paste it into an import request to another server without having parse and modify the XML elements.

In hindsight, this probably isn't a very important use case, as all of the client implementations will probably use a SOAP library to parse the XML into native objects before they even begin to look at it, and generate their requests as object trees first and use the SOAP library to serialize them into XML at the end.

It isn't worth the confusion, within our own group, and explaining it to others, that importing a view has caused.

As Paul says, importing and exporting are two different cases, and they have different meanings.

On an import, we simply want to say "I'm wanto to send a FITS image". The fact that it is a result of a cutout transform of a larger image stored elsewhere is irrelevant. That information may be useful for provenance, in which case it should be encoded elsewhere, perhaps as properties of the node.

Importing data

For importing data into the VOSpace service, we should use a <format> element to denote the file format of the data being sent.

So, a VOSpace node would list the data formats that it could accept :

    <node uri="vos://....">
        ....
        <accepts>
            <format uri="...."/>
            <format uri="...."/>
            <format uri="...."/>
        </accepts>
        ....
    </node>

and a client would specify the <format> that it wanted to send :

    <pushToVOSpace>
        ....
        <format uri="...."/>
        ....
    </pushToVOSpace>

Exporting data

Exporting data is more complex, because we have several things to take into account.

The simple case of the file store can just offer the original file, in whatever format it was originally imported. and we should take care to ensure that the specification makes implementing this simple to do.

However, we have all talked about ideas for systems that can apply transforms to and offer alternative views of the data.

  • A database based service has to generate views of the data, because it does not have the original file.
  • An image archive may offer a cutout service, or it may want to provide other transforms on the stored data.

At the moment what we have is :

    <node uri="vos://....">
        ....
        <provides>
            <view uri="...."/>
            <view uri="...."/>
            <view uri="...."/>
        </provides>
        ....
    </node>
just using <view> elements to indicate what data is available.

This mangles about three different things into one element.

  • What view of the data is being offered (the original image, a cutout fragment, or another transform).
  • What format it is being offered in (FITS-image, jpeg, png or even 'ZIP containing a FITS containing an image')
  • What protocols can be used to access the data (this is missing in the current spec.).

Views

The current XML schema for <view> with <param> elements works for the first part. So, an image archive might offer the original data as-is :

    <node uri="vos://....">
        ....
        <provides>
            ....
            <!-- The original image -->
            <view uri="...." original="true">
            </view>
            ....
        </provides>
        ....
    </node>

Or, a part of the image, with params to specify the coordinates of the fragment you want :

    <node uri="vos://....">
        ....
        <provides>
            ....
            <!-- A cut-out of the image -->
            <view uri="....">
                <param name="...."/>
                <param name="...."/>
            </view>
            ....
        </provides>
        ....
    </node>

Formats

Next, we need to express the formats that these images are availabile in.

In the above example I have called the first <view> the original image. However, a service may also want to offer simple format conversions, providing the same image data in different formats.

Which would mean that the same image may be available in a variety of formats, only one of which is the original :

    <node uri="vos://....">
        ....
        <provides>
            ....
            <!-- The original image -->
            <view uri="....">
                <formats>
                    <format uri="...." original="true"/>
                    <format uri="...."/>
                    <format uri="...."/>
                </formats>
            </view>
            ....
        </provides>
        ....
    </node>
Note that the original flag moves from the view, to one (and only one) of the format elements.

The service could also offer the results of the cut-out service in different formats, but none of them should be tagged as original :

    <node uri="vos://....">
        ....
        <provides>
            ....
            <!-- A cutout of the image -->
            <view uri="....">
                <params>
                    <param name="...."/>
                    <param name="...."/>
                </params>
                <formats>
                    <format uri="...."/>
                    <format uri="...."/>
                </formats>
            </view>
            ....
        </provides>
        ....
    </node>

It might be be quite be difficult to express the rules for the original attibute clearly enough in the XML schema. It should only apply to one format element, in one of the available views.

It might be clearer to provide the original data in a separate <view> element, and use a specific URI to represent original data, rather than using a separate attribute.

This means we don't have to have extra rules about where the original attribute can be set and where it can't

Note - In the current schema we have the original flag as an optional (default true) attribute on any view element. So all of the available views may appear to be 'original', apart from the ones that we remember to explicitly say are not.

Replacing the original attribute with the special URI might make it easier to express the rules in the schema.

  • A <view> element with the special original data URI may only contain one <format> element.
  • A <view> element with any other URI can contain many <format> elements.

You can't create a view that appears to declare generated data as an original, because the view URI describes the transform - setting the view URI to anything other than original data means it isn't original.

So, the image archive could offer the original image file :

    <node uri="vos://....">
        ....
        <provides>
            <!-- The original image -->
            <view uri="[original]">
                <formats>
                    <format uri="...."/>
                </formats>
            </views>
            ....
        </provides>
        ....
    </node>

and the generated formats in a separate view :

    <node uri="vos://....">
        ....
        <provides>
            ....
            <!-- The whole image, in different formats -->
            <view uri="[format-transform]">
                <formats>
                    <format uri="...."/>
                    <format uri="...."/>
                </formats>
            </view>
            ....
        </provides>
        ....
    </node>

Plus a cut-out service that provides image fragments in a variety of formats :

    <node uri="vos://....">
        ....
        <provides>
            ....
            <!-- A cut-out of the image -->
            <view uri="[cut-out]">
                <params>
                    <param name="...."/>
                    <param name="...."/>
                </params>
                <formats>
                    <format uri="...."/>
                    <format uri="...."/>
                </formats>
            </view>
            ....
        </provides>
        ....
    </node>

Separating view and format into two elements means that the <view> element just describes the data transform applied to the data, rather than the file format that is is in.

Protocols

Next we need to look at the available protocols, and how they match up with the views. In the current specification, the protocols are listed on a per-service basis, which imples that all of the protocols are available for all of the views.

This actually makes it very difficult to implement a flexible service in a reliable way.

As most of us are working in Java, the easiest way to implement something like the image transfom or cut-out services would be to use a Java servlet located within the same webapp as the VOSpace service.

This means that the more complex views would only be available via http-get.

If I also wanted to use a GridFTP server to provide access to the data, I would have to do some nasty mangly stuff in the background to make the output from the Java servlet available via the GridFTP server.

As it stands, implementation choices are :

  1. Support a variety of export protocols, but only provide access to the original files.
  2. Provide a range of views and transforms, but only support http-get from a Servlet.
  3. Modify the GridFTP server to integrate it into the VOSpace service.
  4. Provide a variety of views and protocols, but fail unexpectedly for some combinations.

I suspect that the most common option will be (4). People will have a loose coupling between a GridFTP server and their VOSpace service, offering the original files. Plus, they will add one or more Java servlets to the VOSpace service offering a variety of transforms.

The service metadata will say that everything is available via both GridFTP and http and the individual nodes will offer the original file alongside a variety of views. However, request one of the transformed views via GridFTP, and the service will have to reject the request because it isn't available using that protocol.

In the current system, there is no way for the client to find out before hand which views are available through which protocols.

The solution to this is to add the list of <protocols> inside the <format> element, making them specific to the selected node, view and format.

The image archive could provide access to the original image via a number of different protocols, using standard http, ftp and GridFtp? servers :

    <node uri="vos://....">
        ....
        <provides>
            <!-- The original image -->
            <view uri="[original]">
                <format uri="....">
                    <protocol uri="[ftp-get]"/>
                    <protocol uri="[gftp-get]"/>
                    <protocol uri="[http-get]"/>
                </format>
            </view>
            ....
        </provides>
        ....
    </node>

Plus a number of views, which are only available via http-get. Using Java Servlets within the VOSpace service to provide the transforms :

    <node uri="vos://....">
        ....
        <provides>
            ....
            <!-- A cut-out of the image -->
            <view uri="[cut-out]">
                <params>
                    <param name="...."/>
                    <param name="...."/>
                </params>
                <formats>
                    <format uri="....">
                        <protocol uri="[http-get]"/>
                    </format>
                    <format uri="....">
                        <protocol uri="[http-get]"/>
                    </format>
                </formats>
            </view>

        </provides>
        ....
    </node>

Simple file store

Putting this all together, we can represent a simple file store that just provides the original data unmodified :

    <node uri="vos://....">
        ....
        <provides>
            <!-- The original file -->
            <view uri="[original]">
                <format uri="....">
                    <protocol uri="[ftp-get]"/>
                    <protocol uri="[gftp-get]"/>
                    <protocol uri="[http-get]"/>
                </format>
            </view>
        </provides>
        ....
    </node>

Complex image handler

And, we can represent a complex image handler that can also offer format conversions and image cut-outs :

    <node uri="vos://....">
        ....
        <provides>
            <!-- The original image -->
            <view uri="[original]">
                <format uri="....">
                    <protocol uri="[ftp-get]"/>
                    <protocol uri="[gftp-get]"/>
                    <protocol uri="[http-get]"/>
                </format>
            </view>
            ....
            <!-- The same image in different formats -->
            <view uri="[format-transform]">
                <formats>
                    <format uri="....">
                        <protocol uri="[http-get]"/>
                    </format>
                    <format uri="....">
                        <protocol uri="[http-get]"/>
                    </format>
                </formats>
            </view>
            ....
            <!-- A cut-out of the image -->
            <view uri="[cut-out]">
                <params>
                    <param name="...."/>
                    <param name="...."/>
                </params>
                <formats>
                    <format uri="....">
                        <protocol uri="[http-get]"/>
                    </format>
                    <format uri="....">
                        <protocol uri="[http-get]"/>
                    </format>
                </formats>
            </view>
        </provides>
        ....
    </node>

Importing data

The same pattern, of placing the <protocol> elements within the formats can be used on the import side as well (without the views).

So, a file based service may accept the generic any format via a number of protocols :

    <node uri="vos://....">
        ....
        <accepts>
            <!-- The 'any' format -->
            <format uri="[any]">
                <protocol uri="[ftp-get]"/>
                <protocol uri="[gftp-get]"/>
                <protocol uri="[http-get]"/>
            </format>
        <accepts>
    </node>

A more specialised image archive or database system may only accept specific formats sent via http-put :

    <node uri="vos://....">
        ....
        <accepts>
            <!-- The any format -->
            <format uri="[any]">
                <protocol uri="[ftp-get]"/>
                <protocol uri="[gftp-get]"/>
                <protocol uri="[http-get]"/>
                <protocol uri="[http-put]"/>
            </format>

            <!-- VOTable data -->
            <format uri="[votable]">
                <protocol uri="[http-put]"/>
            </format>

            <!-- FITS images data -->
            <format uri="[fits-image]">
                <protocol uri="[http-put]"/>
            </format>

        <accepts>
    </node>

Summary

I'm sure there are several variations on the XML syntax - Places where we could remove elements because their meaning is implicit, or places where we could add elements to make things clearer.

However, the core idea is nesting the elements in the right sequence to express the capabilities offered by the service.

  1. What views of the data are available
  2. What formats are they available in
  3. What protocols can I use to get them

  • views
    • formats
      • protocols

XML complexity

Earlier in the email discussion that sparked this wiki page, I said that I'd prefer not to have large blocks of text and XML in the node properties, because I'd like to keep the nodes small and simple. Here, I'm arguing the case for adding yet more complexity to the node schema.

The difference is how in often this data get transferred. In a large file system with many thousands of nodes, most nodes will only ever appear in the ListNodes? responses. By default, ListNodes? only returns the node properties, and does not include the <accepts> and <provides> lists.

Initially, all the user wants to know is what files are in the directory. They only want to look at the details on one or two selected files within that list. At which point the UI can call GetNode? or GetDetails? to fetch the additional information.

We could even split this into separate methods. GetProperties? gets the full set of properties for a node, including the complex text and XML properties. We could even the <accetps> and <provides> information into separate GetAccepts? and GetProvides? methods.

These would only be used once a user has chooses a specific node and wants to do something with it. At which point, the added complexity is needed to represent all of the options that are available.

However, once the user knows what a particular service can do, then they won't even need to request the details.

If I already know that a particular VOSpace is capable of performing cut-out transformations on FITS images. Then I can write a simple little Python program that uses the ACR to get a list of FITS images from somewhere, drops them into a container in the VOSpace and requests a cut-out view of each of them.

Instant drag/drop image processing workflow, no hard lifting required.

Well .... ok, that is the theory anyway smile

Edit | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r2 < r1 | More topic actions
 
AstroGrid Service Click here for the
AstroGrid Service Web
This is the AstroGrid
Development Wiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback