A development of ideas about the VO Query Language. I'll post this properly after feedback.
Contents
Introduction
A reasonable amount of thought is going into what form a VO Query Language (VOQL) might take (see the discussions on the IVOA
VOQL list and some on the
Registry list). My own feeling is that we might be trying too hard to shoehorn too many types of query into a single language.
Why a VOQL?
AstroGrid proposed some form of Astronomical Query Language (AQL) from the early days of the project. The requirements were never formalised leading to the situation where the number of views were equal to or greater than the number of participants in the discussion.
In general, though, the AQL was seen as one of two types:
- a SQL-like language with additional astronomical functions
- a completely new language requiring syntactic and semantic definition
Following on from the latter, we began investigating the feasibility of ontology-based services and developed some early prototypes but the field is complex and the project could not afford the time needed. It was decided that the AQL would be:
- xml-based, since the primary function was to transmit the query from user interface to dataset
- based on UCDs as column names, to allow for the wide variation in column naming between datasets
- probably SQL-based for ease of translation into actual SQL for rdbms-based datasets
When the IVOA decided to set standards for interoperability, a common query language that could be sent from an originating component (whether a user-facing client or portal or some intermediate component) to a dataset was one of the top priorities: VOQL.
And now?
A couple of alternatives have been proposed so far (as at 04-Mar-2003):
so, covering the two types identified in
AstroGrid.
This indicates that people are approaching the definition of the VOQL from two directions: top-down - how any sort of question that an astronomer might wish to pose might be encoded and sent to the VO black box for interpretation and answering; bottom-up - trying to generalise the types of SQL-based queries currently sent to datasets.
Both approaches are valid but, in order to sort out what we really need, I decided to look at the whole process from the astronomer stating a problem, through to the results being returned.
VO activity
(Click for full-size image)
This Activity Diagram (one of the UML diagram types) is a simplified take on how a user might compose and submit a query to a future
ideal Virtual Observatory.
I've listed nine players in the VO activity associated with a user stating a problem and getting the results:
- User
- Problem Assistant
- Ontology
- Workflow
- Registry
- Job Control
- Data Centre
- Data Source
- Translator
Only the first is a human (in this ultimate VO scenario), the rest are web or grid services.
User
The user has a problem which has to be stated in terms the VO can interpret and resolve. Interaction with the
Problem Assistant will continue until the user is satisfied that the PA has captured the problem correctly.
Problem Assistant
The PA service will lead the user to state the problem as explicitly as possible. It will retrieve terms used by the user from the
Ontology Service and attempt to rephrase the problem in unambiguous language. This clear statement of the problem will be passed to the
Workflow Service. (
Note: it might be useful to store the user's original request and the steps used to remove the ambiguities, eg for debugging if the results are not right.)
Ontology
The
AstroOntology will contain terms and relationships that have been derived from existing astronomical sources (papers, databases, experts, etc). It will also contain metadata about past problems and results. The Ontology Service will provide an interface to the ontology that will enable ambiguities in a statement to be discovered and resolved.
Workflow
The Workflow Service will take in the problem statement (PS) from the PA and create a workflow. Jobs will be created to retrieve data, merge it, analyse it, reduce it, requery, re-analyse etc until the whole of the PS has been satisfied. The
Registry will be used to match terms in the PS or interpreted from the PS to appropriate services (data retrieval, analysis, etc.), or even stored workflows. Where appropriate, checkpoints can be inserted for the user to validate intermediate results and restart the job. Where a service is specified, it will likely be done in generic terms, unless only one specific service is wanted by the user. The workflow is then passed to the
Job Control Service.
Registry
The Registry will contain details (metadata) of all services offered on the VO. These might be data retrieval services (from a single or multiple data sources), data merge services, data mining services etc; they might be analysis services for reducing or transforming data; or they might be services which store temporary results, notify the user etc. The Registry Service will provide an interface to the registry for querying what services it lists.
Job Control
The Job Control Service (JCS) will take in a workflow and realise it. It will interpret the workflow it receives and determine which jobs can be run and in which order, whether in parallel or in serial; it will move the output from one job to where it is needed as input to the next job; it will monitor the progress of each job and update the log. As a specific job is ready to run (either because it is first in a workflow or the job on which it depends has completed), the JCS will use the Registry to identify those
specific services which satisfy the generic needs in the workflow. It will initiate those services with whatever parameters are necessary (eg to point the service at the location of its input).
In the case of a service which retrieves data, this may mean using the owner of the data service, a
Data Centre, to create an instance of the
Data Source Service and pass back a handle to it. The JCS will then submit the query to the data source instance.
Data Centre
The Data Centre Service acts as an interface to all the Data Sources within its remit. (
This service may not exist where the data sources themselves are independently identified in the registry and have the means to be initiated.)
Data Source
The Data Source Service will take in the query from the JCS and use the
Translator to turn that into a form that the data storage software (eg an rdbms or software fronting a system of flat files, etc) can understand. It will then execute the query.
Translator
The Translator will be a module which translates a query from the general data query language into the language of a specific piece of software, eg MS SQL Server, IBM DB2, Sybase or a system which accesses data from FITS files. A data centre need only implement the translator modules relevant to its data sources. If it transfers data from one storage type into another, it need only implement the appropriate module: the data source can continue to receive generic queries as before.
Languages
I believe that communication between the different players (services) above require different languages to be specified. It does not make sense for a single language to be used to cover widely different requirements. The languages I think we will need in the future are:
- Problem Statement Language
- Workflow Language
- Astronomical dataset Query Language
- Ontology Query Language
- Registry Query Language
Whether these should be rightly called
languages or simply
interchange formats, I don't know nor mind.
Problem Statement Language (PSL)
This format will include terms from the ontology which have been disambiguated (probably not a real word but it sounds great). It should be able to be human readable (or, if in xml format as Ed has proposed, easily translated into human readable format).
Workflow Language (WFL)
This will contain the jobs to be run and the order in which to run them, probably based on some network graphing language and/or the commercial BPEL4WS standard. Within a data query job, the query will be in AQL format.
Astronomical dataset Query Language (AQL)
This is likely to be some SQL-based language, as with the JVOQL above. It will simply state the data to be extracted from a dataset and the format in which to return it. It is likely that the columns will be specified using UCD format.
Ontology Query Language (OQL)
This is the least likely
language. Interaction with the ontology is more likely to be by means of specific calls to the service rather than some generalised query, but who knows.
Registry Query Language (RQL)
This is the means by which the metadata of services listed in the registry are queried. Again, this may be simply a number of calls as in the OAI protocol or might be a fully generic language - it will depend on the outcome of the standards discussion in the Registry
mailing list and workgroup.
Conclusion
I've divided my conclusion into two parts: what we should concentrate on in the near future (within next 1-2 years), and what we should start thinking about for the more distant future (components for delivery in 2-5 years).
Near future
Looking at the components listed above, we are a long way from having any idea of how to build either the Problem Assistant or the Ontology (we could find out how to build this but have little idea of how to use it once we had it). Here I do not mean
we as VO builders but
we in terms of anyone working in a similar space. There is a lot of work going on in the fields of problem definition and ontology (creation and use) but there is little that could be put to practical use in the next couple of years.
The first component that we could conceivably build is the Workflow. It makes sense therefore to concentrate our efforts on languages that will be needed in the near future:
- Registry Query Language (RQL)
- Workflow Language (WFL)
- Astronomical dataset Query Language (AQL)
In fact, in my opinion, we should concentrate in the VOQL group on specifying the AQL. Early prototypes which implemented that could be made available and thereby offer data services on the VO sooner rather than later. The Registry working group should concentrate on the RQL. Once these two are completed, we can look at how to specify a stream of jobs in a workflow (WFL).
Further future
To get to the stage where an astronomer can type in a more-or-less natural language query and have the VO interpret that and return the results will take much research and development work.
The best approach might be to start developing the AstroOntology itself, since the means of querying that will likely fall out of the format. The IVOA might consider an AstroOntology mailing list so that those who are initiating work in this field can at least know that they are not alone.
Even longer term, we can look at the means whereby an astronomical problem can be turned from a wholly ambiguous natural language statement into something the VO can handle.
--
TonyLinde - 04 Mar 2003