In June 2007,
NormanGray and
TonyLinde submitted a proposal into the JISC Capital Call 01/07 to develop SKUA (Semantic Knowledge Underpinning Astronomy). In October 2007 we were advised that the project was accepted by JISC and would be funded from Jan'08 to Jun'09 at the requested level (£310K). The project will involve both principals (Norman @ 50%, Tony @ 33.3%) and we'll be looking to hire an additional PDRA (Level 7) for the 18 months. A separate web site will be developed for this project as well as an informational page at JISC and we will post links as soon as these are available. For now, we'll post a few details from the proposal:
Executive summary
We propose the creation of a semantic infrastructure for astronomy based on the organisation of assertion services with relatively simple interfaces. Astronomy has been part of the UK's e-Science
effort since its inception, the majority of this under the
AstroGrid project. The focus of this effort,
in the UK and within projects in at least 15 other countries, is the creation of a worldwide Virtual
Observatory (VO), making astronomical data and applications easily available to astronomers
regardless of their location and affiliation. The VO will, by defining and implementing standard interfaces,
make it possible to access common resources from multiple applications. These resources
are located via a globally distributed resource registry, which has been defined and working for
over two years now.
To date, relatively little work has been done within the VO effort on semantic systems development.
An ontology of object types has been developed by VO-France and one of us (Gray) has
developed an access control system based on OWL inferencing and a mechanism for converting
the standard VO registry XML format to RDF triples. Our project will provide a semantic infrastructure
with toolkit and API which will make it possible for many more VO developers to engage
with Semantic Knowledge Organisation Systems (SKOS). The key benefit of this proposal is that
it engages with an existing vibrant development and user community, and builds upon working
infrastructure, making it possible to demonstrate and prove both concepts and tools as we develop
them. In doing so, we engage with key outcomes of the Capital Programme and its e-infrastructure
programme.
The core concept of SKUA is that of a Semantic Assertion Collection (SAC). A SAC is a service
combining an RDF triple store with an interface providing the ability to:
- store, modify and delete assertions (RDF triples);
- return the result of SPARQL queries; and
- optionally federate its queries to one or more other SACs.
This simple extension to proven tools forms the basis of an infrastructure
which supports federating tags and queries across multiple collections, covering perhaps a user’s
personal collection, that of a project they are working on, the department they belong to, and the
worldwide VO. This allows for the construction of very personalised queries.
On top of this layer of capability, we will construct a few sample applications to demonstrate
some of the additional functionality that it might provide. We expect other developers to build
many more such examples. This layer and the SAC components will be packaged as a toolkit for
these developers. In addition we will take part in JISC and astronomy meetings to promote the
technology.
Introduction
The Semantic Web has, with startling speed, graduated from wild-eyed vision 5 to deployable engineering.
The goal of letting computers ‘understand’ has solidified into established practice and
competing implementations, so that now, with the bleeding edge moving off into yet more exotic
directions, is the ideal time to bring the core technologies to practical application. Europe has a
world-leading role in the world-wide Semantic Web (SW) community, the fruit of years of heavy
EC investment in the technology. The SKUA project will embed this expertise in a UK project, thus
disseminating it from the UK to the worldwide VO community, and within the UK to the other
metadata projects supported by the JISC.
The SKUA Project (Semantic Knowledge Underpinning Astronomy) will implement a distributed
architecture of semantically aware RDF stores. This ‘semantic layer’ will support a cluster of applications
which will either directly support users in finding and recovering useful resources, or
indirectly support them by supporting user-facing applications. We describe the architecture and
an initial set of applications below. Although the system we build will be specialised to astronomy,
and proved by its interaction with, and eventual embedding within, the Virtual Observatory, the
bulk of the semantic knowledge is localised in the RDF store, with the design goal that it could be
replaced if desired by the analogous semantic knowledge of a different domain.
SKUA architecture
Project architecture
The core component is a network of Semantic Assertion Collections (SAC) providing rather generic
semantic Web Services. For performance reasons, we expect the semantic reasoning within the
SACs to be rather simple, with more elaborate reasoning either performed in the background and
separately asserted, or simply retained within value-adding clients. The optimal level of integration
with, or even replacement of, the VO registries, will become clear during the course of the project.
This structure integrates with e-Infrastructure outcomes by supporting new ways of retrieving
data, and by integrating with key initiatives in the wider research community.
We conceive the semantic layer as a directed
acyclic graph (DAG) of SACs, each of which
can store a greater or smaller number of RDF
triples and, crucially, federate queries to a configurable
list of partner stores, in such a way
that a query against one SAC is effectively made
against the RDF triples stored in that SAC and
all the SACs that it federates to (Fig. 2). Thus the
personal SAC, which may be a local desktop
service or a personal section of a remote service,
will typically store user-specific annotations
or notes, and the global SAC will store
VO-wide information such as an RDF mirror of
the VO Registry. Information is transparently
shared by being copied from a local SAC to an
appropriate one of the SACs shared within a research group, or an ad-hoc group of collaborators,
with this copy process being managed, directly by the user, using a small UI, or as a part of a
separate user-facing application’s functionality.
Each SAC has a (standard) SPARQL endpoint which will respond to queries both from clients and
from other SACs which federate to this one. Each SAC will also support a simple RESTful API for
managing its RDF data.
A SAC must not respond to queries indiscriminately, since to do so would expose possibly private
annotations; each SAC will keep a list of those SACs to which it has permitted federation. The
topology of federations is specified exclusively by the SACs which do the federation; the permission
to query or to write to a SAC is the responsibility of the SAC being federated to. The VO is deploying
a SSO/Security infrastructure which this project would make use of. This infrastructure would
handle the authentication issues involved, but we anticipate leaving the SAC access-control as the
responsibility of the SACs themselves (either internally, or at the HTTP layer if appropriate).
We believe these three functions – querying, updating and sharing RDF information 17 – will support
a flexible and open-ended array of user-supporting client applications, and we will validate this
assertion by developing an initial set of such applications, as described below.
The SKUA project uses standard standard technologies and protocols, composed in an innovative
way. The SACs will build on one of multiple available triplestore implementations; they
will be queried using the
W3C -standardised SPARQL query language (
http://www.w3.org/TR/rdf-sparql-query/ as at June 2007). The VO security infrastructure realises JISC investments
by building on the Shibboleth infrastructure. The simple SAC management interface will be
specific to the SACs, but there will be no requirement for this to go beyond the standard REST interaction
pattern. Our goal is to produce a simple, open-source, and easily composable,Web Service,
proved by applications. This builds on the PIs’ experience with generations of application/service
deployments in the VO and other projects.
Case studies and completion scenarios
The core of our proposal is the SAC architecture described in Sect. 3.1. The SAC servers will comprise
a relatively thin layer on top of currently available triplestore technology, and so we do not
expect the server implementation to be challenging.
Deployment and user buy-in will be at least as large a problem. The PIs have a long and continuing
involvement in the VO community, and so can lead this deployment and react quickly to
user requirements. However, user acceptance can be encouraged by producing exemplar applications,
which illustrate how the architecture can be used, and which are independently useful. We
describe two such applications here, which we will implement during the course of the project.
Tagging resources and sharing bookmarks
The most basic use of the SAC network, used by both of the applications below, and most immediately
usable by existing user-facing applications, is to allow users to tag and bookmark resources
on the web or within the VO (since tags and bookmarks are technically identical, and differ only
in how they are used, we will talk only of tags below), and share those tags with other users. Web-
2.0 services such as del.icio.us and Flickr have shown how very successful simple tagging can be,
both to let users re-find resources they have found useful, and to be told of resources they had not
found before. We can do better than simple tagging, however, since a tagging application can
make use of the semantic context available from the SAC to suggest and interpret tags both when
tagging and when querying. At least one existing VO application uses a private tagging framework,
demonstrating that the demand is present.
Application: Spacebook – semantic VRE
As the name suggests, Spacebook has an interface and (liberal) sharing model styled on the very
successful social software application, Facebook. In the case of Spacebook, though, individuals
will be able to create and share queries, workflows and assertions about VO resources, in addition
to supporting a professional/social network. In this, Spacebook will be a type of Virtual
Research Environment (VRE) with additional semantic functionality. The VRE aspect will include
portlets which embed components from the
AstroGrid VO project including: query construction
and submission, workflow construction and submission, virtual storage and jobs status; all these
components are available now. Analogously with Facebook, Spacebook will have the concepts of
Person, Institute, Group and Project, with Institute membership keyed to a user’s institutional email
address. Individual users may create Groups, and Spacebook administrators may create Projects.
Scenario: Claire logs into Spacebook and sees a summary of activities in all the areas to which
she belongs including current status of long-running jobs that she has submitted. One such job
was a complex workflow which has completed. She verifies the results are valid, tags the workflow
script in order to describe it and then pushes the script into her project area [Spacebook will transfer
the script from Claire’s virtual storage area to the project’s, it will then pick up all the assertions in
Claire’s SAC associated with this workflow and push them to the project’s SAC, with her agreement;
this will also move assertions relating to the workflow’s components]. In a blog she reads about
a new paper published in her field so tags that for later reading [Spacebook adds assertions about
the paper (via an arXiv URL), and passes the paper to a text mining tool which parses the paper
for terms in the VO astro-ontology, her SAC and federated SACs, adding them to the assertions for
that paper – Claire can review and change them when she later reads the paper]. She then moves
into her Project area in Spacebook (where the workflow appears as a new item added). One of
her colleagues has created a new version of a data analysis tool that implements an algorithm the
project has developed. She makes this tool accessible to ‘friends’ in a Group specially created to
test the tool [Spacebook copies assertions about the tool to the Group SAC]. Finally, Claire wants
to execute a query that a colleague has placed in the Project area but over a different set of data
sources. She begins typing into a search box; as she types each term, a graphical representation of
associated terms appears with tags often cited together appearing closer. One term in the tag-cloud
catches her eye as crucial and she clicks and adds this to her list of terms. In a window separate
to the tag representation, a list of data resources appears and is refreshed as she enters each term
[as she types, Spacebook conducts searches on each term or set of terms through Claire’s personal
SAC, the project SAC and all SACs to which these are federated; data resources associated with
highly cited tags will appear on the resource list]. Claire picks the data sources she wants to use,
submits her query and heads off for a coffee.
Application: Suggestions server
A continuing problem within the VO is that of browsing or searching the existing 24 registries for resources
of interest, since the obvious ways of doing this produce either too few, or far too many
hits. The situation is improving with the arrival of better interfaces, but the semantically rich information
available within the SAC network (the user’s local SAC plus those it delegates to) would
allow for richer query support. We have preliminary designs for a ‘suggestions server’, acting as a
web service, which would take a list of one or more resources of interest, and return other sets of
resources related to the initial ones by an open-ended set of algorithms, using semantic relationships,
connections to existing astronomical controlled vocabularies, and statistical cluster analysis,
implemented as plugins to the server.
Scenario: Jules is writing an application to help users find new VO resources. His user has
already identified a few useful resources, and Jules would like to find more similar ones. He makes
a simple query to a suggestions server, listing the known resources, and asking for ‘more like this’;
the server responds with groups of resources which are ‘like’ the initial set in various more-or-less
heuristic ways, leaving Jules to display these to the user in whatever way best fits with his UI.
Other use-cases
Using
NaCTeM tools, and other specialised text-mining tools developed with the VO, we can
conceive of one or more SAC client applications deriving information from text sources and adding
it to a personal or group SAC.
Another value-adding client application would be an access-control service, managing role and
group information asserted within, and distributed amongst, the SACs. We have outline designs
for such a service, which would build on the distributed nature of our semantic layer, but do not
intend to implement it in this project, simply using it as one of the potential use-cases to drive the
design.
Addendum on Skuas
Check out this
wikipedia article.
I don't know how long this link will be valid but there is a
short video clip here of the BBC
Nature of Britain programme showing the
Arctic Skua: best bit shows them dive-bombing Alan Titchmarsh.
Moral of the story: when choosing project names, it helps to read
all of the relevant Wikipedia entry. Skuas are scavengers (we'll take information from anywhere and process it into muscular goodness – good connotation); it turns out, however, that skuas are also short-tempered psychotic kleptoparasites who, when they've finished stealing other birds' food, will bully seagulls for fun – not such a good vibe. Hmm: can we change the project name to Helpful and Knowledgeable Fluffy Bunny Rabbit? Please?