Federating results of searches
Federating results of searches

Introduction

This document deals with catalogue originated data and the problem of recognizing equivalent physical quantities -listed in different catalogues under a variety of related (yet inconsistent) names- with the purpose of presentation, comparison or the wish to do some arithmetics on them; that being the upcoming desire to do datamining in a multi-dimensional space.

The problem itself is equivalent to the one faced by a researcher who wants to combine information found in two or more tables from the literature. First one has to find out which columns represent the same physical quantity despite their labels, if they do, are they expressed in the same units? If so, one can combine them, otherwise one may have to bring the data into a common unit. Once that has been achieved it is possible to ask scientific questions about the newly federated data. The computer aid process does not differ much from the manual one, but the volume of data to combine may be much larger and the user may expect more than what is reasonable in terms of reliability.

How we end up with a group of catalogues is not relevant to this research, but two of the main venues are worth mentioning here:

  • Selection of catalogues based on their meta-data properties. For instance, applying restriction to fields like title, author, release date, number of records, number of columns, publication related properties, column content in the form of Unified Content Descriptors or UCDs, Column names, units or a combination of column and data content.

  • Performing of a single or multiple target cone search on a set of catalogues (which is a form of column and data content sub selection).

In the former case, data should be retrieved after the meta-data based catalogue selection, while in the latter data and metadata may be present at the same time if the transport format is VOTable.

We had a good idea of the complexity of the problem, therefore, we decided, in order to understand the problem better to ingest the meta-data contained in one of the largest data collections in the world: the Vizier collection of about 10000 tables and approx. 124000 columns.

The results we present in this work are then the result of a theoretical and empirical approach.


Grouping quantities

The columns in catalogues, the main elements to federate, can be characterized by several elements, the main ones being:


eg eg
name Vmag mag_V
units mag mag
brief description Johnson's V magnitude magnitude
UCDsPHOT_JHN_V PHOT_JHN_V
DataTypefloatfloat

Other important elements are scale (linear/log), statistical properties such as range (minimum, maximum), mean value, etc. All these elements are part of the meta-data of a given catalogue and can conceivably be stored in registries. The examples shown in the table above are taken from real catalogues published in the literature.

As expected, UCDs were the most useful piece to group columns representing the same quantity, after all, UCDs were designed and developped for that purpose: UCDs represent another piece of meta-information attached to each column in a catalogue which acts as a label or flag indicating the type of quantity this column represents.

To understand better the relation between column names and UCDs in a catalogue, it is good to know that traditionally, a column name appears only once in a catalogue, a fact particularly suitable for any DBMS. A column has one and only one UCD attached to it within a catalogue. A UCD may occur more than once in a catalogue if several columns represent the same (physical) quantity.

The first thing one can do to see if "column A" from "catalogue 1" has a counterpart in "catalogue 2" is to note the UCD attached to "column A" and look for the UCDs attached to columns in "catalogue 2". Any number of columns may have that UCD (from 0 to several). If no columns have a matching UCD then it is safe to say that there is no counterpart in "catalogue 2". If one columns exhibits a matching UCD, then it is possible that the two columns are equivalent, further analysis needs to be done as the reality shows that astronomers can measure quantities in different units or use the same words to mean slightly different quantities. If more than one column matches the UCD, the same analysis as before needs to be done, but leaves the question of which one is the real match open. We strongly advocate that such question should be answered by the user; providing defaults or best guesses may lead users to trust the matching process beyond its abilities.

Problems and ambiguities

The matching in UCD does not assure per se that two quantities are directly comparable. Some of the problems follow:

  • The main source of discrepancy is the use of different units to measure the same quantity. One could appeal to have a valid set of units for a given UCD, but that would rely too much on the data, and errors in the assignation of UCDs do occur "polluting" the pool of valid units.

    Eaxmple: PHOT_FLUX_X, the UCD representing a flux in any X-ray band contains measures in the following units: mW/m2, eV/s/cm2,

    uJy, ct/cm2/s, and no-unit . The no-unit case comes from a few catalogues where the energy band corresponding to the measurement is shown in a different column. Only mW/m2 and eV/s/cm2 are equivalent (except for a scale factor) to each other (more later).

    we have made a list of the UCDs we've found incorrectly assigned to columns in vizier and will pass this list to Francois Ochsenbein to correct them.

  • Other problems arise when a quantity is expressed in linear scale while it appears in log scale in a different catalogue. A typical example is temperature, which depending on the author's flavour, can be expressed either as T or logT (temperature is normally in Kelvin, but in one case its units are kT, which is energy, but could be interpreted as kiloTesla). Vizier uses a convention to denote units in logarithmic scale, making it easy to spot, but that is not in widespread usage.

  • Even more dangerous is the case in which quantities are represented as a differential: the units are the same, the scale may be the same, but there is a different zero point for one or more quantities. Example, TIME_DATE, the UCD representing a date, is expressed in a relatively small number of units, but one of the explanations is: Observation time, (year-1900)*100+month, while in others is the conventional dd/mm/YYYY or even a Julian date or modified Julian date.

  • Of particular relevance is the fact that Right Ascension and Declination may be presented for different equinoxes. This represents a real problem unless services could produce precessed coordinates, including equinox dependent quantities like Proper Motion in RAi and Proper Motion in Dec.

Data federation prototype

To see how serious these problems were and to find solutions we built a prototype which emulates a registry, allowing the selection of catalogues based on their metadata (http://barbara.star.le.ac.uk/datoz/mykats.html).

Once the catalogues have been selected, the process of matching UCDs is used to find out which quantities occur and with which frequency. A few issues (obvious and not so obvious) appeared:

  • When the number of catalogues is large, very few quantities (if any) occur in all catalogues.

  • It is possible to use all the UCDs present in the catalogues to form groups of catalogues with similar characteristics, enabling a more refined search of catalogues by the user.

  • Selection of large number of catalogues have the advantage of showing a large variety of ways to express a physical quantity, mostly in terms of units and scales.

  • Catalogues with similar subjects tend to group well (with several columns in common), and the effect is even more marked in catalogues which form part of a series or a mission.

Solving the ambiguities

The units are indeed the best indicator to pick up errors in the assigning process of UCDs and to find correspondance between quantities.

Given a list of columns representing the same quantity and expressed in different units it is not trivial to decide which ones are equivalent. One method we used turned out to solve the problem quite easily: dimensional analysis. Each unit can be converted to a combination of basic units in the SI (kg, m, s, C, A, etc). Two units are then equivalent if their SI representation is the same. For the previous example of PHOT_FLUX_X, the following equivalences occur:

UnitEquivalent SI Factor
mW/m2kg.s-30.001
eV/s/cm2kg.s-3 1.602177* 10-15 :
uJykg.s-210-32
ct/cm2/sm-2.s-110000

This type of analysis not only tells us that two units are the same, but it allows us to find the conversion factor to transform one into another, which is a vital step into transforming units for display or datamining purposes.

We performed the conversion using a utility developped by CDS which takes the unit and transform it into its SI equivalent and a conversion factor. This service is not a web service yet, so what we did was to apply it to all possible units listed in Vizier's meta-data (~ 6000) and store the results.

The prototype made use of this feature to determine if columns were scalable within a certain UCD.

Despite being a prototype, the method is powerful enough to be applied in any situation and it would be worth exploring its incorporation as a web service, but it must be said that there are still pending issues, in particular in cases where angular and time units are mixed, mostly due to historical reasons.

Unsolved issues

One of the issues still to solve is the one refering to scale. Vizier uses its own way to warn the user that quantities are expressed in log scale, but as this is not a standard, that piece of meta-data should be incorporated into the registries.

The problem of differentials is one which is important to tackle as well, particularly because traditionally quantities related to time measure are the ones experiencing this problem, which impedes the direct comparison between quantities. Clearly this could be solved by having a zero point component in the meta-data.

These two issues are relevant in building a data-model.

Angular quantities measured in units of time are a problem for dimensional analysis and perhaps context should be added in order to solve the ambiguities. A "second" may represent a second of time or 15 arcsec in the equator (an angle).

The use of non-conventional units appears as well: distances measured in units of Jovian radius, or in units of the distance from the Sun to the center of the Galaxy. These diverse scale factor could be solved by introducing a scale factor into the data-model.

There are more subtle and difficult issues to handle which have to do with the existence of different conventions to express the same quantity, eg, spectral index, proper motion in RA (with or without cos(delta) correction), etc. These discrepancies are not easy to spot and can not be solved with a simple formula in most cases. We need to find a way to make users conscious about them and avoid the temptation of combining them without thinking.

Data Merging: another prototype

In order to explore the problems posed by merging VOTables we decided to experiment with the result of combining the results of several cone searches involving one target and several catalogues. We also used Vizier for that purpose by requesting the results to come up as a VOTable. We were limited to consult a very small number of catalogues or to concatenate several tables together.

A web based prototype was built (in perl) to interpret the VOTable transforming the data and meta-data into a perl structure which was temporarily stored to allow its repeated use. The objective to reach with this prototype was to study the best way to present a table the result of several cone searches as described above. Similar quantities should be grouped in the same column and possibly transformed to one just unit.

The problem of merging was dealt in the other prototype, but producing a table with real data, where the user says what s/he wants as output is a problem that had not been explored. Our prototype allows the user to upload a VOTable to the server and retrieve a customized table (in HTML so far).

Several criteria were set, based on experience, regarding the kind of features a user may wish to see in such interface. The main emphasis was to give users the freedom to modify as many features in the table as possible. The list below represents our wish list.

  • Decide whether the matches should be sorted by increasing distance to target or grouped by catalogue.

  • Decide which columns to display and in which order.

  • If different units are used in different catalogues, decide which unit to use and whether to perform a unit transformation.

  • If more than one column is associated to a UCD in any catalogue, decide which column(s) to use.

  • Decide whether to display data from catalogues which don't have a given quantity.

  • Decide if quantities which are not suitable for merging (according to the automatic criteria) should be displayed or even merged with others.

Data Merging: steps to a real implementation

Our work has shown that it is possible to perform data federation starting from an analysis of the involved catalogue's meta-data, which permits to find out the columns which can be combined and the factors needed to apply when unit conversion is possible and needed. Once this analysis has been done and with a strong input from the user, it is quite straightforward to produce the desired output.

Merging is possible only if the set of catalogues has a similar scope. The universe of catalogues encompasses quite different families of data/observations, some of them with empty intersection.

The prototype we built relies on the presence or availability of certain information and methods to produce data, which is available in a limited (yet growing) number of sites.

Currently, the main limitation is imposed by systems which are not able to query a large number of catalogues at the same time in cone search mode. This is solvable by creating a service which would allow the retrieval of individual tables; one table for each pair (target, catalogue). This is not the best solution in terms of resource usage, but it would work.

Underlaying problems remain in terms of the degree of consistency of VOTables produced at different sites. We have already seen variations (and even inconsistencies) between the tables produced at CDS and the ones produced at XMM-VILSPA (Spain) which make the task of interpreting and integrating them more complicated.

The problem of merging a number of tables coming from a single target cone-search is then quite "simple" and if the number of tables is kept to a reasonble figure (up to a few hundreds), there should be no problems with the existing computer resources (mostly memory). Until services allow performing a cone-search in more than just a few catalogues, individual querying and retrieval is the solution. Assuming VOTables are equivalent, then the number of services invoked should not present a problem either.

More challenging scenarios exist, such as a multi-target, multi-catalogue cone-searches, where the merging of tables originated in different services will be more complex. As this scenario is one which can be quite popular, it is definitely worth exploring in more details.

The true merging of two or more catalogues is a very desirable goal to achieve. Column merging can be much easier to handle than deciding which are the sources which represent the same physical object in the sky. This kind of applications have relevance in joining multi-wavelength data and in studies involving the time domain: photometric variability and positional shifts of the sources. The main limitation in this case would be imposed by the size of the catalogues involved. A high perfonmance language should be preferred to a script or multi-platform language. In the case of very large catalogues (2MASS, USNOB1.0), it is unthinkable to perform the merging in any other place than the machines hosting these catalogues.

In summary, merging single-target multi-catalogue cone searches is something achievable in the very short term; merging of multi-target multi-catalogue cone searches is probably doable with the current resources and in a timescale somewaht longer than in the previous cases; true merging of two or more catalogues is something which needs a longer time to design, develop and implement.

One important aspect to explore and implement is the use of visualization. In any of the three broad cased described above, having an idea of the location of the matched sources respect to the target(s) can not only make life easier for the user understanding the distribution of sources from different catalogues, but it can prove to be invaulable in discovering systematic trends when merging two or more catalogues. Assuming that the most likely match is the closest one is meaningless if there exist a systematic shift of coordinates. One should look for accumulation point(s), which would indicate that the most likely match is found at the location of such point (in a space defined by delta-RA, delta-Dec). Generation of static figures is the first step to take. Once the problem has been understood from the astronomical point of view, interactive solutions should be developed (Java applets).

Suggestions for future VO usage

A large number of catalogues do not list positions in the celestial sphere but do list names of celestial objects. These catalogues are now invisible and unavailable to perform cone search type queries. Due to the relevance of many of them, it is desirable that versions are created solving the object name into sky coordinates.

The problem of using different units can be easily solved a posteriori, once the data is already available, but in queries of the type: select catalogues where POS_EQ_RA_MAIN > 4 the question is ambiguos. Does 4 represent hours, degrees or radians? These types of queries are already appearing in some services (XMM- Spain) but have a limited validity if no units are specified.

Differences in scale (log or linear) can be a serious problem for a number of important variables, prompting to the need to have an indicator as to what type of scale data are expressed in and data are requested.

Units should be incorporated to the registry query language.

-- PatricioOrtiz - 26 Sep 2003

Topic attachments
I Attachment Action Size Date Who Comment
htmlhtml federatingResults.html manage 19.8 K 2003-09-26 - 10:08 PatricioOrtiz  
Topic revision: r2 - 2003-09-26 - 14:51:13 - NicholasWalton
 
AstroGrid Service Click here for the
AstroGrid Service Web
This is the AstroGrid
Development Wiki

This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback