Tools for Large Astronomical Consortia
Background
Most current VO tools are targetted at individual astronomers using public services. Conversely, the infrastructure software is targeted at data centre staff with technical skills. However a key part of modern astronomy is in large consortia (50-100 astronomers) who wish to mix new and old datasets - eg HERMES, ATLAS, UDS, GOODS, COSMOS. If we can give these people a good service we are on to a real winner.
AstroGrid put some effort in this area in the past, but in fact I think we may now be ready to make real progress.
Some things are easy (VO Explorer tweaks); some are ready but lots of work (
VODataPublisher tool); and some are in between (new File Explorer interface - maybe
VOSpaceViewer, set up communities).
Many of these ideas emerged while at ESO talking to Evanthia Hatziminaoglou and Jochen Liske, who have pressing needs in this area.
The hardest one first. These consortia have often promised funding agencies to "publish their data in a VO compliant fashion" but (a) don't know how, and (b) are VERY reluctant to put any real effort in this direction. There are two routes. (i) Hosting : CDS, LEDAS, etc say "give us your data and we will set it up". (ii) Some kind of really simple tool :
VODataPublisher. For example user drags a collection of FITS files onto an icon labelled "make SIAP"; form pops up that is half filled in from looking at the FITS headers. Another button says "publish to Registry".
In fact I suspect the best option may be to combine these two. We are talking about a tool that hosters provide to uploaders. As well as handling registration etc, consortia will want to update their own stuff, so will need a kind of administration tool.
Based on discussion at the recent interop, VO Space and DSA give us the infrastructure we need, but the tool engineering for this is significant. Six months ? If this is worth doing we would need to decide who is available to do it, and start a proper design.
Public versus private datasets
These consortia often need a mixture of existing (public) and new (private) datasets. Many folk in these consortia don't like using the public versions of datasets. Their instinct is to collect, download, and store everything they need. This is partly conservatism but also for good scientific reasons. (i) The public versions keep changing. When they do a crossmatch, they want to use a specific version, and keep the exact same data set for reference later. (ii) The crossmatch algorithms they want to use are more complex than simple positional matching, for example Bayesian algorithms using multiwavelength SEDs. They want to do this once, using their own carefully written software, save the result, and tell their team "here is the answer - use this catalogue". I have heard consortium leaders say "if team members do their own crossmatch they will get it wrong". (iii) They often want to re-do the photometry, so they want access to the pixels from a survey, not just the source catalogue database.
There is no need to resist these feelings. We must give people what they want - BOTH public and private collections. I think we can do this with slight changes to VO Explorer, and a new look VO Space interface.
Public dataset collection : VO subscription.
Of course individual consortium members can already find and collect whatever they like with VO Explorer. It would be much better however if a chosen consortium member could build a collection on everybody's behalf, make it automatically available, and periodically update it. But we have this ! Noel already set this up with the "New Subscription" item in VO Explorer. However : (i) it is currently completely undocumented, (ii) nobody knows about it, (iii) the UI needs a bit more thought, and (iv) some things don't seem to work.
I made this work locally very easily. I could make a new folder in VO Explorer and export the XML description to a local file. Then "New subscription" allowed me to browse to that file and it appeared. I checked that I could edit this by hand and next time I started VO Desktop it picked up the latest version.
I tried pointing to an htpp://xxx type URL, the way one can do in VO Desktop preferences to point to a different "examples" list. But this didn't work. Got message "failed to load from blah.. Could not read blah".
Next I tried loading from a file in VO Space. This worked fine, but thats just private to me. We need to make this file available to the consortium but not public. So we really need to implement the idea of setting up a "community" for a specific consortium, and access control lists. Is there already a way to set up an area for "all community" access, along with a private folder per member ? Questions : who runs the VO Space service, and the community service ?
Finally, the (large) tooltip in File Explorer from hovering over the "location" box suggests that one should be able to do this via ftp. I tried with a file on a public ftp server but couldn't make it work.
If we could quickly make the http or ftp methods work, we could have a very quick victory here; but the better long term solution is to make VO Space attractive to consortium members : something like
VOSpaceViewer .
- I'd expect the http and ftp methods to work - more testing needed. Maybe the production of new subscriptions (and updating of them) needs to be slicker - a 'publish' or 'update publication' option, rather than exporting the xml and saving manually? Furthermore, as well as sharing subscriptions, maybe annotations to resources should be shared too?
- I think Norman's extensions to VOExplorer might be close to allowing this kind of scoped sharing and collaboration already - bypassing the problem of community and groups in vospace -- NoelWinstanley - 2009-11-20
Private dataset collection : VO Space interface.
There is no need for the consortium to publish everything they make or collect as a public VO service. We should make it easy for consortium members to get at private files, including that one-off best-crossmatch etc. At the moment they would do this with links on a web page that they set up by hand. It could be much better if they could see their private data collection in a common VO Space, partly so they can immediately view files with Topcat or Aladin etc, but also to make it easy for the responsible individual to add new files.
So the key things again are (a) setting a up a community so consortium members see common files, preferably with separate individual areas too, and (b) setting up a nice freidnly but simple
VOSpaceViewer. How ready for this are we ?
Single tables : it is already the case that if you click on a VOTable file, you will be offered "send to Topcat". Simply advertising this would already make a big difference.
Multiple tables, i.e. databases : see below.
Single images : again, clicking on a FITS file already offers "send to Aladin" etc.
Collections of images : At its simplest this could be done by hand. The consortium administrator could give files sensible names and structure them in folders etc, and just let people browse. The key thing is that this would be much easier to update than editing an html file with links by hand. However, maybe we would want a kind of private SIAP service ? A way to have these searchable by RA,Dec ?
Single/multiple spectra : same as for images but substitute VOSpec, SPLAT-VO etc.
Private databases
The SAADA tool is designed to take your tables, turn them into a database, give you a simple web interface for querying, and even publish services. It is not trivial, as it involves setting up Tomcat and
MySQL, but it seems to work and is very promising. It looks like this is not a functionality we want to duplicate; rather we should be looking at integrating it with DSA etc to make a turnkey solution.
Regardless of which tool makes databases, it occurs to me again that we don't need to dragoon consortia into publishing services. We just need to give them the tools to play with their own. This is where the recent joint experiments with
CfA are important. One would like to click on a database and be offered "build query with VOQuery".
Proper multi-table databases : we need to make sure that we can pick up the metadata correctly, so VOQuery (or the existing QB) can show the tables and columns.
Single tables : very often members will want to run a query on the columns of a single table (eg catalogue). It could be that the best way to do this is just to load into Topcat. Alternatively this could be done at source - especially if its a big table. Can VOQuery or the existing QB pick up the metadata from the VOTable itself and offer a query build ?
- sure, VOQuery could offer a query build from the metadata in a votable (just fetch the head of it, I guess), but this is of limited use unless the vospace is actually able to RUN the query - this is the missing functionality at the moment. -- NoelWinstanley - 2009-11-20
--
AndyLawrence - 2009-11-18
Noel's Comments
Got it. I think this is what Dave's been talking about for years - creating ad-hoc data services in vospace
- services are discovered by browsing the vospace, rather than browsing the registry
- vospace takes care of preventing unwanted accesses
- vospace provides a way to update / alter the data
- vospace provides a way of describing the additional capabilities (e.g run an ADQL query, perform a SIAP query) available on each file / folder.
Sounds lovely, It'll be great when it's done, but it's asking a lot of VOSpace, isn't it! There's minimal changes required to the
FileExplorer UI (just detecting and displaying the additional capabilites), but lots of stuff added to the back end vospace server. There's also the issue of whether the vospace protocol supports all that's needed to achieve this (yet) too.
If all this is achieved for ad-hoc private services, then the same approach might as well be used to publish the public-facing data services of the consortium too.
I've sketched out some ideas as to how this enhanced vospace could be achieved here -
VoSpaceOs
Keith's Comments
It's important to remember where these requirements came from. The latest incarnation was a talk given by Evanthia at the Interop, but it is a recurring theme: How do resource-strapped data centres publish their data in the VO? The issue is multi-faceted but boils down to two things:
- Publishing data in the VO currently requires a skill-set that encompasses both technical VO and domain specific understanding of the data to be published
- We (the IVOA) do not provide publisher level documentation
Taking these one at a time, even using good publishing tools like DSA and DALToolkit requires a detailed understanding of both technology and data to work well. Without Publisher-level documentation (I know we produce sys admin docs for our services - I'm talking about "The Complete Dummies Guide To Publishing Data in the VO" i.e. where the hell do you start??) all data centres have to fall back on are the IVOA Protocol specs and they run screaming - and rightly so. This was the experience Evanthia described when trying to ensure the Herschel data are VO-enabled from the outset.
The IVOA must come to understand that there is no "us and them", "astro and techie" divide, there are just customers and consumers of IVOA defined services - services conceived and delivered by the combined efforts of members of the IVOA, technical and scientific together. It is my contention that the
customers of the VO are the data centres and service providers whose published data and services are
consumed by astronomers. In fact, if the astronomers even know they are using VO, we've exposed too much technology at the sharp end. And no, I do not have a quick answer for the political implications of being so successful we are invisible; but that has to remain the long term goal if VO is to realize its potential.
So, off the soapbox.
The suggestions contained in
VoSpaceOs are right on the money (otherwise I would estimate 12 months hard graft to add TAP to VOSpace) - add-on capability (scripting etc) is the non-prescriptive, most efficient way forward. And yes,
this is ServerAR! - it's what I had in mind all along; making it as easy to provide server-side functionality as it is client-side. That said, it is
not enough simply to provide the capability - we (the IVOA) also need to provide functionality
and the documentation that meets the basic needs of data and service providers
out of the box - without recourse to Protocol Standards and hard-core programming.
Personally, I think we should target a few high value publishers and solve their specific problems using existing VO tools + whatever glue is needed to make it all work. I'd go further and say we should actively seek out some far reaching research and help that in the same way and make a name for VO as the only way to do research of type XYZ - but that isn't a popular view.
--
KeithNoddle - 2009-11-23
P.S. I don't see why services cannot continue to be discovered via Registry; interfaces bolted onto VOSpace are surely registrable as separate entities?
Guy's Comments
In
AutomatedDataPublishingViaVoSpace I discuss the registration issues of data services associated with VOSpace.
Maybe I'm dim, but I don't see how a scripting interface to VOSpace makes it easier to associate VOSpace with a DAL interface. I can see that some things, like dumping the FITS headers from an image file, can be well done with such a script; but those things are explicitly
not DAL interfaces. DAL stuff is several orders of magnitude more complicated. As an extreme case, nobody should expect to re-implement DSA/Catalogue as a Groovy script.
I think we have to distinguish two sets of use cases:
- non-standard hacks;
- standards-compliant installations.
The VOSpace-scripting leads naturally, IMHO, to a big pile of non-standard hacks. This is, potentially, really useful within a collaboration as a way of sharing specific data, privately. However, it's
not "publishing data in the VO" in any form that interoperates. To consume data exposed in this way, you'd need clients specialized for each set of scripts -- taking us right back to 2001.
Further, data exposure via scripting can easily be a displacement activity for real publishing. Make it easy to expose data in a non-standard way and providers will just do that and nothing more. Therefore, I would much rather work on real, publishing tools than scripting.
It's important to remember that the core of the publishing is the description of data and the consumption of that description by the DAL software.
If we sort this out, the same solution could be used for publishing though VOSpace, or through a custom interface.
--
GuyRixon - 2009-11-23