Streaming Registry Client.
Implementation notes and feedback about the ag registry service
Noel Winstanley
XFire
I've used XFire (
http://xfire.codehaus.org/ ) to implement a streaming reg. client within AR. It's able to handle much larger results than the previous client - there's no out of memory errors like there was with the previous (axis based) registry client. Implementing a client in xfire was a bit tricky - most of the examples are for the server-side of SOAP.
Parsing of Query Results
The reg client returns an array of Resource objects - one object for each registry entry. These objects (in the org.astrogrid.acr.ivoa.resource package in the desktop/api project) are based on the new 1.0 schema, and contain all the data within the registry xml documents. These objects are parsed from the xml stream returned from the service - there's no buffering of result documents before the parsing happens.
Although the objects have been designed to represent the new v1.0 schema, the parser currently works over the old v0.10 registry schema. When we switch registry schema versions, the parser that creates the data objects will need adjusting - but the objects (and code that uses them) will be unchanged.
New Registry Schema
I've based the design of these Resource objects on the example v1.0 registry entries in Kevin's test v1.0 registry at
http://msslxt.mssl.ucl.ac.uk:8080/astrogrid-regtest-1_0/browse.jsp .
Could someone tell me where these examples come from? Are the siap, cone, entries based on a finalized schema, or are they just made up examples? Either way, hopefully they won't change too much - they seem quite sensible to me.
One exception to this was the registry entry for a
CEA server - which seems incorrect to me. The other service types add additional data (such as type of siap service) within the
capability element (extending this to another xsi:type). Within the capability, there may be one or more
interface elements that define the different ways of accessing the service (each interface may represent a different protocol version, or maybe a different aspect of the service).
However, the CEA server extends the
interface element to list
managedApplications. I think this information belongs within the capability, rather than within a single interface of that capability - it would then fit in with the way the other service types describe themselves. Comments?
I think that the CEA schema have been updated on this server - the latest are on
http://www.ivoa.net/twiki/bin/view/IVOA/RegDMApplications - in particular within the
http://www.ivoa.net/internal/IVOA/RegDMApplications/app_v1b2_schema.zip attachment - these schema lag the core 1.0 schema a little in their official development status (i.e. they are still subject to some possibility of change, though it would be good to have this finalized in Moscow...) but they do follow the new service registration style...There was a reason for putting the
in the interface, basically because it would be possible to have a SIAP service with a CEA interface for instance - i.e. the Capability element is supposed to give metadata about the service itself (e.g. sky coverage in the case of SIAP) and the interface element is supposed to give metadata about how to call the service. In the case of a pure CEA server there is really little distinction between the two cases, though and I can see how you would feel that the managed applications should be in the Capability.
-- PaulHarrison - 21 Aug 2006
The distinction between service metadata in the capability and how to call the service in the interface isn't reflected in the examples for Cone and Siap. Coverage is placed outside the capability element. Metadata such as 'max search radius', 'max file size', 'whether verbosity parameter is accepted' are in the capability not the interface
It seems to me that placing managedApplications in the capability makes CEA more uniform with all the other schema types.
Your example of a SIAP service with a CEA interface is interesting, but seems quite contrived. Embedding a 'cea' interface within a 'siap' capability is just going to make things harder to find. Notice that capability has a standardID attribute - this could either be the standard for 'siap' or for 'cea'. Not both.
I thought it more likely that a registration for a cea server would have a capability that lists the managedApplications, and then 1 or more interface element, which list different versions of the CEA protocol - each of which provides access to the managed applications.
-- NoelWinstanley - 22 Aug 2006
It is good that this debate is happening now, as the proposed schema have sat uncommented upon on the IVOA site for quite a while - perhaps the debate should be moved onto the IVOA twiki, and brought up at the forthcoming interop.
I think that the example of a SIAP service with a CEA interface is not contrived, but is actually the central use case for the Capability Interface model - Iook at the Registry Metadata document section 5 quoting, Capability metadata, which describe what the service does, its limitations, and other behavioral characteristics. and Interface metadata, which describe how to access the service—the inputs and the outputs - my contention is that CEA is really just an interface definition - you can tell me what a SIAP server does, but you can only tell me how a CEA server does stuff.
Incidentally - I think that the SIAP description in the test registry is old - compare with the latest example from the IVOA twiki http://www.ivoa.net/internal/IVOA/RegUpgradeSummer2006/sia.xml and see text in http://www.ivoa.net/Documents/WD/ReR/VOResource-20060530.html - but you are right coverage has been placed outside even capability, as there seems to have been created a new overall CatalogService Resource level type - I think that there was a 'historical reason' for doing this - but have forgotten exactly - e.g. see end of http://www.ivoa.net/forum/registry/0606/1666.htm
-- PaulHarrison - 22 Aug 2006
Okay I now put the sample up and it went into the test registry at: SIA_new
The older one did not validate something about invalid QName around the xlink:href or coord_system_id can't remember which. But it seems he fixed the xml and it went through now.
-- KevinBenson
Registry Explorer
The Registry Explorer UI in the workbench now uses the new reg client. The speedup is noticable.
In particular, it displays the first results while the later results are still being parsed - and doesn't crash from exhausted memory. From the 150806 snapshot at AstroRuntime, try entering 'galaxy' or 'abell' into the registry explorer (not app launcher - which only searches a subset of the whole registry). You'll see there's an initial wait while the query document is constructed on the server, but after that the results are displayed incrementally in the registry browser.
Streaming Registry Service
So, the most noticable delay in querying at the moment is on the server-side. The current registry implementation builds a DOM of the results before returning them, over SOAP, to the client. This buffering of the results slows down the process, and also makes it likely that the registry server will fail with an out-of-memory error on large queries.
I think it'd be sensible to use XFire to re-implement the registry query service in a streaming manner - then there's be no buffering of results in a DOM - instead the resultset from exist could be iterated through and written straight out. This should improve the speed and reliability of our registry implementation. I think Matthew Graham is looking at using XFire within Carnivore - as he posts to the XFire users mailing list occasionally.
Yes as of last week I have started making the changes to make it streaming server side by using Xfire. The code at the moment seems easy to implement though several changes take place from processing the DOM to Streaming. For the current way of Registry here is the first paragraph from an e-mail I sent:
Current Registry gets this ResourceSet? which is an collection of your results, (but you don't actually get the XML into memorry till you call getResource(int)) in your collection. But since I did not have streaming currently it tends to take this resourceset call this getMembersAsResource method which essentially makes one big XML DOM from the whole collection and return it as you note causing things to be quite slow. The xfire implementation I am working on will take the ResourceSet? and go through the collection piping it to the Stream connected to a XMLStreamReader? and what should produce a significant response time.
By the way any idea how to actually see/confirm it is streaming back?
-- KevinBenson
Caching
To further speed up the registry client, I'm using a caching library called ehcache ( http://ehcache.sourceforge.net/ ) to cache previous results. It's quite simple to use: the api looks similar to a conventional Map. Ehcache can be configured to use either/or/both of a memory cache and a disk cache. The caches can be LRU, or FIFO, etc. It's also possible to set expiry times for the cached results.
Within the registry client, the cache keys are the XQuery string, and the results are the array of Resource object. I use a small memory cache backed by a larger disk cache, and the cache entries are set to expire after 24 hours. Once AG becomes more stable, I may make the expiration period longer (a week even?). I've added an operation to the 'Edit' menu to flush the cache.
This cache gives a significant speedup when the same query is done repeatedly (try querying for 'galaxy' again) - as serialized results are read from local disk, rather than running a new query on the registry service. Another place where this shows is in AstroScope & Helioscope. The lengthy query that lists all Cone services, for example, is only passed to the Registry service on the first run. All other runs that day use the cached copy. This is faster for the user, and reduces the load on the registry service.
I think it'd be sensible to use EhCache? to cache results on the Registry server too. Reading a serialized result from the cache should be faster than running the query in the Exist xml database. Just a small cache - would speed things up by ensuring the answers to the most common queries (e.g. resolving endpoints for communities, list all cone services, etc) were at hand.
Querying.
The new registry client accepts both ADQL and XQuery queries - but I'm downplaying the ADQL interface - as the abilities for ADQL to query a hierarchical document collection are limited, and also because AstroGrid lacks a working ADQL/s->ADQL/x parser (e.g. it won't accept 'NOT', even though that is part of ADQL). Furthermore, the future of ADQL is uncertain at the moment.
The workbench now uses XQuery throughout to query the registry. It's quite hard to develop these queries at the moment (mostly because I'm still learning). It'd help to know some more about how xquery works in Exist.
What version of exist are we running? - where is this information available from.
For development, is there some way of running a query directly on the Galahad Exist? - e.g. some kind of XQuery dialog / console?
Are xqueries passed straight through to the exist database, or does the registry client manipulate / sanitize / transform them in any way?
Answers to above questions:
eXist on galahad is quite old and me and Catherine really need to update it espcially since they now have a finalized version of eXist. From stressing&scalability tests I have recently been doing on the new eXist I could not hit my target of 2seconds per query (substring/keyword queries). But I could reach 2.5 to max 4seconds (this is just to the db). I should note currently we do a "for $x ..." type xqueries; that did not perform well, it was much better to do //Resource[{query}]
Our root is actually /Astrogrid/Resource or for 0.10 case it would be /Astrogrid/vor:Resource. The RootResource? syntax is probably only on the test registry not galahad in which you pass in RootResource? and it translates it to the correct Resource on the backend.
Straight XQueries via the Service for the most part is straight through except it does one thing wrap an element around the results. Which is this is because it does all the soap itself.
If you were at galahad then there are some ways of querying directly at eXist and if you could login to galahad there is a commandline console. There is also a xmlrpc client but it would be hidden if your not at leicester.
-- KevinBenson
Indexing
I found that some xqueries ran very slowly, while equivalent queries expressed in different ways ran significantly faster. What indexes, if any, are there on the Galahad Exist?.
For example, doing a string match search using the standard XQuery 'contains' proved to be very slow. Using the '&=' operator is much faster. However, '&=' is an exist extension - not part of the XQuery language. I think it's important to keep the workbench free of implementation-specific details - so that someone else could connect it to a different registry implmenetation, etc. Would indexes, or different ways of arranging the resources, speed this up?
Info to above statement
eXist automatically indexes the xml, but we should add other indexes to help certain queries (namely what is called range indexes). Awhile back I got Matthew Graham's indexes there is not many of them but could help in certain queries. I have not tried yet, but there is xquery-standard match-any() and a match-all() that we should switch to instead of the &= I suspect it will be quick like &= but I think you lose expressions ability though I don't think we used it anyways.
-- KevinBenson
Collection Root
It's not documented how to refer to the root of the registry document collection. I've tried /vor:Resource which doesn't work. The new Registry Standard says that the collection root is called RootResource. I'm unable to make that work either.
At the moment, all the xqueries start with //vor:Resource (which means 'match a Resource element, anywhere, at any depth in the hierarchy). This is very inefficient - as it needs to search throughout every registry record for a match. What is the collection root?
Summary
I think we'd have a top-class registry service if we could add (in order of difficulty)
- Documentation on how to write efficient queries
- Indexing
- Caching
- Streaming
Other Questions
Has an endpoint been published yet for the Registry Of Registries? Or is there a temporary version somewhere I can take a look at. In particular, I'd like to know if the query interface for a registry is listed in the RoR?, or is it just the harvesting interface?
I only know of the temporary one at: http://nvo.ncsa.uiuc.edu/cgi-bin/rofr/oai.pl
This is only meant to have an OAI interface, and I need to tell ray to put my test reg in it, as you can see from: http://nvo.ncsa.uiuc.edu/cgi-bin/rofr/oai.pl?verb=ListRecords&metadataPrefix=ivo_vor there is only its definition of itself.
--Main.KevinBenson
Schema evolution
I'm concerned with the evolution of the schemata in respect of CEA.
First, I agree with Paul'sNoel's point above concerning CEAService v1.0. We should have a CEA capability instead of mutating the interface element.
Second, assuming that we want to end up with CEAService 1.0, with new structure to exploit capability, we should be careful about how we get there. If we make it one big change then lots of AG software has to change to suit. If we have an intermediate step, CEAService 0.3 say, which is the existing record type refactored onto VOResource 1.0, then we have an easier transition. It should be easier (trivial?) to deconstruct this in the AR, because it's the same structure in a new namespace. If it happens that all consumers of VOService just look for the ManagedApplications? element and ignore the container then this intermediate form does help as much.
Third, as ever, I urge everybody not to issue sucessive versions of a schema without changing the namespace URI.
Fourth, if I understand correctly, CEAService is still internal to AG, even if its namespace URI starts with http://ivoa.net/. Therefore, we can have as many intermediate versions as we need to get it right before beatification in the IVOA. It's not subject to Roy's "one more schema change ever" fiat. What we have isn't very 1.0 yet.
-- GuyRixon - 22 Aug 2006
The evolution of the CEA schema has been made a IVOA level issue - it is on the Registry WG Roadmap http://www.ivoa.net/twiki/bin/view/IVOA/InterOpMay2006ResReg#Session_6_Roadmap - however, I agree that the only sensible approach for AG to take is to develop a CEAService 0.3 that has a smaller number of changes to bridge the gap.
- PaulHarrison - 22 Aug 2006