This document deals with catalogue originated data and the problem of recognizing equivalent physical quantities -listed in different catalogues under a variety of related (yet inconsistent) names- with the purpose of presentation, comparison or the wish to do some arithmetics on them; that being the upcoming desire to do datamining in a multi-dimensional space.
The problem itself is equivalent to the one faced by a researcher who wants to combine information found in two or more tables from the literature. First one has to find out which columns represent the same physical quantity despite their labels, if they do, are they expressed in the same units? If so, one can combine them, otherwise one may have to bring the data into a common unit. Once that has been achieved it is possible to ask scientific questions about the newly federated data. The computer aid process does not differ much from the manual one, but the volume of data to combine may be much larger and the user may expect more than what is reasonable in terms of reliability.
How we end up with a group of catalogues is not relevant to this research, but two of the main venues are worth mentioning here:
Grouping quantities
The columns in catalogues, the main elements to federate, can be characterized by several elements, the main ones being:
| | eg | eg |
| name | Vmag | mag_V |
| units | mag | mag |
| brief description | Johnson's V magnitude | magnitude |
| UCDs | PHOT_JHN_V | PHOT_JHN_V |
| DataType | float | float |
As expected, UCDs were the most useful piece to group columns representing the same quantity, after all, UCDs were designed and developped for that purpose: UCDs represent another piece of meta-information attached to each column in a catalogue which acts as a label or flag indicating the type of quantity this column represents.
To understand better the relation between column names and UCDs in a catalogue, it is good to know that traditionally, a column name appears only once in a catalogue, a fact particularly suitable for any DBMS. A column has one and only one UCD attached to it within a catalogue. A UCD may occur more than once in a catalogue if several columns represent the same (physical) quantity.
The first thing one can do to see if "column A" from "catalogue 1" has a counterpart in "catalogue 2" is to note the UCD attached to "column A" and look for the UCDs attached to columns in "catalogue 2". Any number of columns may have that UCD (from 0 to several). If no columns have a matching UCD then it is safe to say that there is no counterpart in "catalogue 2". If one columns exhibits a matching UCD, then it is possible that the two columns are equivalent, further analysis needs to be done as the reality shows that astronomers can measure quantities in different units or use the same words to mean slightly different quantities. If more than one column matches the UCD, the same analysis as before needs to be done, but leaves the question of which one is the real match open. We strongly advocate that such question should be answered by the user; providing defaults or best guesses may lead users to trust the matching process beyond its abilities.
Problems and ambiguities
The matching in UCD does not assure per se that two quantities are directly comparable. Some of the problems follow:
Eaxmple: PHOT_FLUX_X, the UCD representing a flux in any X-ray band contains measures in the following units: mW/m2, eV/s/cm2, uJy, ct/cm2/s, and no-unit . The no-unit case comes from a few catalogues where the energy band corresponding to the measurement is shown in a different column. Only mW/m2 and eV/s/cm2 are equivalent (except for a scale factor) to each other (more later).
we have made a list of the UCDs we've found incorrectly assigned to columns in vizier and will pass this list to Francois Ochsenbein to correct them.
Data federation prototype
To see how serious these problems were and to find solutions we built a prototype which emulates a registry, allowing the selection of catalogues based on their metadata (http://barbara.star.le.ac.uk/datoz/mykats.html).
Once the catalogues have been selected, the process of matching UCDs is used to find out which quantities occur and with which frequency. A few issues (obvious and not so obvious) appeared:
Solving the ambiguities
The units are indeed the best indicator to pick up errors in the assigning process of UCDs and to find correspondance between quantities.
Given a list of columns representing the same quantity and expressed in different units it is not trivial to decide which ones are equivalent. One method we used turned out to solve the problem quite easily: dimensional analysis. Each unit can be converted to a combination of basic units in the SI (kg, m, s, C, A, etc). Two units are then equivalent if their SI representation is the same. For the previous example of PHOT_FLUX_X, the following equivalences occur:
| Unit | Equivalent SI | Factor |
| mW/m2 | kg.s-3 | 0.001 |
| eV/s/cm2 | kg.s-3 | 1.602177* 10-15 : |
| uJy | kg.s-2 | 10-32 |
| ct/cm2/s | m-2.s-1 | 10000 |
This type of analysis not only tells us that two units are the same, but it allows us to find the conversion factor to transform one into another, which is a vital step into transforming units for display or datamining purposes.
We performed the conversion using a utility developped by CDS which takes the unit and transform it into its SI equivalent and a conversion factor. This service is not a web service yet, so what we did was to apply it to all possible units listed in Vizier's meta-data (~ 6000) and store the results.
The prototype made use of this feature to determine if columns were scalable within a certain UCD.
Despite being a prototype, the method is powerful enough to be applied in any situation and it would be worth exploring its incorporation as a web service, but it must be said that there are still pending issues, in particular in cases where angular and time units are mixed, mostly due to historical reasons.
Unsolved issues
One of the issues still to solve is the one refering to scale. Vizier uses its own way to warn the user that quantities are expressed in log scale, but as this is not a standard, that piece of meta-data should be incorporated into the registries.
The problem of differentials is one which is important to tackle as well, particularly because traditionally quantities related to time measure are the ones experiencing this problem, which impedes the direct comparison between quantities. Clearly this could be solved by having a zero point component in the meta-data.
These two issues are relevant in building a data-model.
Angular quantities measured in units of time are a problem for dimensional analysis and perhaps context should be added in order to solve the ambiguities. A "second" may represent a second of time or 15 arcsec in the equator (an angle).
The use of non-conventional units appears as well: distances measured in units of Jovian radius, or in units of the distance from the Sun to the center of the Galaxy. These diverse scale factor could be solved by introducing a scale factor into the data-model.
There are more subtle and difficult issues to handle which have to do with the existence of different conventions to express the same quantity, eg, spectral index, proper motion in RA (with or without cos(delta) correction), etc. These discrepancies are not easy to spot and can not be solved with a simple formula in most cases. We need to find a way to make users conscious about them and avoid the temptation of combining them without thinking.
Data Merging: another prototype
In order to explore the problems posed by merging VOTables we decided to experiment with the result of combining the results of several cone searches involving one target and several catalogues. We also used Vizier for that purpose by requesting the results to come up as a VOTable. We were limited to consult a very small number of catalogues or to concatenate several tables together.
A web based prototype was built (in perl) to interpret the VOTable transforming the data and meta-data into a perl structure which was temporarily stored to allow its repeated use. The objective to reach with this prototype was to study the best way to present a table the result of several cone searches as described above. Similar quantities should be grouped in the same column and possibly transformed to one just unit.
The problem of merging was dealt in the other prototype, but producing a table with real data, where the user says what s/he wants as output is a problem that had not been explored. Our prototype allows the user to upload a VOTable to the server and retrieve a customized table (in HTML so far).
Several criteria were set, based on experience, regarding the kind of features a user may wish to see in such interface. The main emphasis was to give users the freedom to modify as many features in the table as possible. The list below represents our wish list.
Data Merging: steps to a real implementation
Our work has shown that it is possible to perform data federation starting from an analysis of the involved catalogue's meta-data, which permits to find out the columns which can be combined and the factors needed to apply when unit conversion is possible and needed. Once this analysis has been done and with a strong input from the user, it is quite straightforward to produce the desired output.
Merging is possible only if the set of catalogues has a similar scope. The universe of catalogues encompasses quite different families of data/observations, some of them with empty intersection.
The prototype we built relies on the presence or availability of certain information and methods to produce data, which is available in a limited (yet growing) number of sites.
Currently, the main limitation is imposed by systems which are not able to query a large number of catalogues at the same time in cone search mode. This is solvable by creating a service which would allow the retrieval of individual tables; one table for each pair (target, catalogue). This is not the best solution in terms of resource usage, but it would work.
Underlaying problems remain in terms of the degree of consistency of VOTables produced at different sites. We have already seen variations (and even inconsistencies) between the tables produced at CDS and the ones produced at XMM-VILSPA (Spain) which make the task of interpreting and integrating them more complicated.
The problem of merging a number of tables coming from a single target cone-search is then quite "simple" and if the number of tables is kept to a reasonble figure (up to a few hundreds), there should be no problems with the existing computer resources (mostly memory). Until services allow performing a cone-search in more than just a few catalogues, individual querying and retrieval is the solution. Assuming VOTables are equivalent, then the number of services invoked should not present a problem either.
More challenging scenarios exist, such as a multi-target, multi-catalogue cone-searches, where the merging of tables originated in different services will be more complex. As this scenario is one which can be quite popular, it is definitely worth exploring in more details.
The true merging of two or more catalogues is a very desirable goal to achieve. Column merging can be much easier to handle than deciding which are the sources which represent the same physical object in the sky. This kind of applications have relevance in joining multi-wavelength data and in studies involving the time domain: photometric variability and positional shifts of the sources. The main limitation in this case would be imposed by the size of the catalogues involved. A high perfonmance language should be preferred to a script or multi-platform language. In the case of very large catalogues (2MASS, USNOB1.0), it is unthinkable to perform the merging in any other place than the machines hosting these catalogues.
In summary, merging single-target multi-catalogue cone searches is something achievable in the very short term; merging of multi-target multi-catalogue cone searches is probably doable with the current resources and in a timescale somewaht longer than in the previous cases; true merging of two or more catalogues is something which needs a longer time to design, develop and implement.
One important aspect to explore and implement is the use of visualization. In any of the three broad cased described above, having an idea of the location of the matched sources respect to the target(s) can not only make life easier for the user understanding the distribution of sources from different catalogues, but it can prove to be invaulable in discovering systematic trends when merging two or more catalogues. Assuming that the most likely match is the closest one is meaningless if there exist a systematic shift of coordinates. One should look for accumulation point(s), which would indicate that the most likely match is found at the location of such point (in a space defined by delta-RA, delta-Dec). Generation of static figures is the first step to take. Once the problem has been understood from the astronomical point of view, interactive solutions should be developed (Java applets).
Suggestions for future VO usage
A large number of catalogues do not list positions in the celestial sphere but do list names of celestial objects. These catalogues are now invisible and unavailable to perform cone search type queries. Due to the relevance of many of them, it is desirable that versions are created solving the object name into sky coordinates.
The problem of using different units can be easily solved a posteriori, once the data is already available, but in queries of the type:
select catalogues where POS_EQ_RA_MAIN > 4 the question
is ambiguos. Does 4 represent hours, degrees or radians? These types
of queries are already appearing in some services (XMM- Spain) but have
a limited validity if no units are specified.Differences in scale (log or linear) can be a serious problem for a number of important variables, prompting to the need to have an indicator as to what type of scale data are expressed in and data are requested.
Units should be incorporated to the registry query language. -- PatricioOrtiz - 26 Sep 2003
| I | Attachment | Action | Size | Date | Who | Comment |
|---|---|---|---|---|---|---|
| |
federatingResults.html | manage | 19.8 K | 2003-09-26 - 10:08 | PatricioOrtiz |
![]() |
Click here for the AstroGrid Service Web |
This is the AstroGrid Development Wiki |
|