Dataset Federation

What does Dataset Federation Mean?

At the Technical Support Panel meting on 3rd June 2003 (see TspReports03) I was asked to write a report on how to make progress in dataset federation. This will form part of the work on the astronomical data warehouse (or ADEF, Astronomical Data Exploration Facility) from 2003 July onwards. The first problem is to decide what we mean by dataset federation. I don't think that there is a widely accepted definition, except in vaguest of terms. The discussion at the TSP meeting brought out at least two different assumptions as to what the important features of federation were for our datasets:

  • Cross-matching source catalogues, i.e. joining them on source position overlap, sometimes called the fuzzy join
  • Merging the results obtained when the same query (such as a cone-search) is sent to several different source catalogues.

These are related, but rather different problems. To see the connection, consider what happens if the user uploads a list of celestial objects of interest, and wants to find identifications with known objects. This can be treated either as a series of cone searches, in which case it would be nice to have the results merged in some way, or as a spatial join of the user's list with one or more standard catalogues.

I think that both of these facilities are needed for a fully-functional virtual observatory, but unfortunately neither is easy to provide. I shall assume that to do federation properly we have to support both of them but look forward to seeing other opinions on the issue. It is clear, however, that this F-word is ambiguous, so I suggest that we avoid using it.


Cross-matching Source Catalogues

A lot has already been written on this, so here I just summarise the problem. The basic operation is finding matches, say to each source in a catalogue of X-ray sources, to those in another catalogue of sources detected in another waveband, say the optical band. The positions in each catalogue are given by a coordinate pair (RA,DEC) with an associated error-region, usually (but not always) a circle, because positions can never be measured without some uncertainty. The basic positional match requires the two error circles to overlap, which is a join operation in database terms.

From here on I shall switch to database notation: the source catalogue is a table, with one row per source, and the columns include RA, DEC, error-radius, and other parameters such as magnitudes or fluxes that merely need to be propagated to the output. In practice, especially with catalogues with large error-circles, a single row in the first table may match two or more rows in the second table: in such cases other criteria such as spectral types can be used, together with scientific judgement, to work out the most likely counterpart. But even in such cases finding the positional coincidences has to occur first, other selections can follow, and will be ignored in what follows here.

It is often scientifically important also to find rows (sources) in the first table which have no match in the second table, which in database terms is a left outer join. Another important feature that we need to support is allowing the users to join their own table of source positions with standard catalogues.

Database experts all tell us that a join is practically impossible except when the two tables are resident on discs on the same computer. These experts might all be wrong but I think we should assume for the moment that joins, including spatial joins, require the two tables to be part of the same database and resident on the same computer system. Distributed databases are a hot research topic and current implementations have many limitations: perhaps later we ought to investigate this area. At present, however, I shall assume that joins between tables cannot be done over a wide-area network. This is one of the main justifications for the data warehouse (or ADEF).

All relational database management systems (DBMS) are designed to perform joins efficiently, but only on exact matches of columns of integers or character strings. The join we need involves some messy trigonometric functions on coordinate pairs (RA,DEC) and will be done extremely slowly unless suitable indices are created, and suitable SQL used to make use of them sensibly. For a two-dimensional join we need a spatial index. Similar limitations apply to the cone-search, indeed the spatial join is just a large set of cone searches of the second table, one per row of the first table. Many (most?) current astronomical data archives get along at present without a spatial index, just using a simple B-tree index on one dimension. This is usually declination to avoid the wrap-around problem with the RA coordinate (0 degrees RA is the same as 360 degrees). This would be adequate for the smaller and even medium-sized tables, but gets increasingly inefficient as they get larger. Since some tables already contain over a billion rows (e.g. USNO-B), and we are committed to finding scalable solutions to the large datasets likely to arise from future telescopes, I think we have to seek a solution involving a genuine spatial index. For an more comprehensive account of the problems and possible solutions, see SkyIndexing and SpatialIndexing.

It is worth noting that two existing web sites support cross-matching to a limited extent. Astrobrowse is a service of HEASARC at GSFC which provides a limited number of pre-set cross-matches to be searched. The Vizier catalogue service of CDS at Strasbourg also has a number of pre-computed cross-matches. In addition it has a more flexible service as you can upload a list of your own coordinates which can be searched on multiple catalogues. The output, however, is verbose and totally unmerged, and there are limits to the number of matching rows from each table.

Spatial Indexing Options

There are several possibilities here:

  1. Use a DBMS which supports spatial indexing. The front-runners are:
    • Postgres: which includes R-tree indexing, and worked well in our initial evaluations, see DbmsEvaluations.
    • DB2: which has an optional spatial index based on a multi-dimensional grid-file.
    • MySQL, the latest version of which supports spatial indexing also using R-trees.
  2. Use a 2-d to 1-d mapping function which allows use of any DBMS supporting a B-tree index on integers (which all of them do). A mapping function to be used on a spherical surface requires a specially devised coding scheme otherwise the singularities at the poles and wrap-around at zero RA cause problems. Two suitable mapping functions devised by astronomers are:
    • HTM (Hierarchical Triangular Mesh)
    • HEALPix (Hierarchical Equal Area iso-latitude Pixelation)
  3. A third option is to create a multi-column index on a combination of RA and declination, but converting one to an integer since it won't work if both are floating-point quantities. Some DBMS can then do an efficient look-up. This is the method used by JHU for their SDSS databases, using SQL Server. The limitation is that one coordinate (they chose declination) has to be converted to an integer pointing to some set of discrete zones; this means that queries which span two or more zones become more complicated.

The main advantage of the mapping function approach is that you can use any old common-or-garden DBMS. The disadvantage is that the spatial join is quite a bit more complicated:

  1. First you have to make an additional table to hold the pixel codes for each catalogue table,
  2. Than create a B-tree index on these new tables.
  3. Then create another table for each join, which identifies the pixels involved.
  4. Then create a B-tree index on this.
  5. Finally the cross-match involves a 3-way join using the two original cataloges and the pixel-match table, assisted by the indices.
All of this is perfectly feasible on any DBMS, but is undoubtedly more complicated: SpatialIndexing has a worked example with all the gory details.

Given that there are at least three DBMS free for our use (two Open Source, the third free for academic use) which support proper spatial indexing, I propose that we do not at present pursue the mapping function approach unless there turn out to be serious performance problems with using a true spatial index.

Which DBMS?

At present there is no clear leader. Postgres R-tree indexing worked as it claimed right out of the box, and Postgres is a full-featured object-relational DBMS, but may be beaten on performance for other data mining operations by simpler DBMS such as MySQL. The spatial option in DB2 has not yet been investigated (it has only recently been released for Linux), but appears much more complex to use, with a number of tuning parameters to be explored (you have to choose the extent of the domain to be used at each level of the gridfile). This is consistent with the general reputation of "mainframe" DBMS such as DB2, which are designed for environments where a team of expert database administrators and tuners are on hand. This tunability will make it harder to evaluate. Whereas the spatial indices of Postgres and (I think) DB2 just support indexing of rectangular boxes, the indexing recently added to MySQL supports polygons of any number of sides. This power is probably not useful for our application, but may make the indices larger or slower. The spatial indexing is also a new feature and perhaps less stable than that of the others. I have been monitoring bug reports and seen few so far, but perhaps this merely reflects the few users of the feature.

Given the lack of a leading product, I think a short evaluation of spatial indexing for these three DBMS needs to be carried out before making a choice. Since we really need to choose a single DBMS for the data warehouse as opposed to having a different DBMS for each type of operation, the other relevant properties for data mining also need to be taken into account. This should a priority for the data warehouse work in Q7.

User-provided Tables

It is vital to allow users to cross-match their own source catalogues with sets of standard ones. This means providing the data warehouse with an uploading facility. Tables may be in one of a number of different formats, all of which need to be supported:

  • Fixed-column text
  • Comma-separated value (CSV) text files
  • VOTable format (rare at present, but likely to grow in importance)
  • FITS tables (ASCII or binary).
Previous evaluation of DBMS indicates that nearly all of them can only import ables in CSV format. Even here there are no standards, and each DBMS has its own rules about e.g. handling fields which contain commas, and handling null values. The correct handling of nulls is especially important in astronomy. I think we shall need a few utilities to convert all the permitted import formats (including the four above) into a CSV format which is compatible with the chosen DBMS. We then need to construct a suitable web service which allows a user's own file to be uploaded, converted to the required format, and ingested as a DBMS table. Most DBMS have heavyweight security features, and these will need to be used, or perhaps avoided, so that authorised users (or perhaps any user) can ingest a table. This may eventually require links with AstroPass or MySpace components.

Steps towards Deployment of a Cross-matching Service

  1. Evaluate spatial indexing facilities of the short-listed DBMS using substantial chunks of existing catalogues, e.g. USNO.
  2. Evaluate also their performance on other tasks related to data mining, and their ability to utilise parallel hardware such as Beowulf clusters.
  3. Install chosen DBMS on suitable server, e.g. hydra at Leicester, as demonstrator of ADEF.
  4. Acquire a suitable collection of standard public catalogues, e.g. USNO, 2MASS, SLOAN DR1, others currently on LEDAS, and maybe Vizier.
  5. Create spatial indices for all these.
  6. Provide a web service which supports the spatial join operation between these tables
  7. Develop software to do format conversions between VOTable, FITS, and other formats to the required internal CSV format.
  8. Provide a web service which allows the uploading and ingestion of users own catalogues.
  9. Use the results of the table-merge study (see next section) to allow one table to be cross-matched with a number of others.
  10. Create document which describes all these steps, so that other ADEFs can be deployed elsewhere.

Other Spatial Join Considerations

The results of a spatial join are likely to need additional columns to be computed, giving for instance the offsets in the two coordinates, or perhaps the radial distance. It may be desirable to order the results, where two or more rows in table2 match one in table1, in ascending order of distance.

In many cases, perhaps most, an approximate positional coincidence match will not uniquely identify matching sources: it may be necessary to use other criteria as well. In addition it will often be useful to know the probability of obtaining a match by chance. This could be done either by repeating the spatial join with a much larger error circle, or using a pre-computed value for the spatial density of objects in the region of interest.

How to Implement?

At present it is not clear that a join between tables is feasible over a wide-area network at least for tables of any size. Since AstroGrid is supposed to be seeking scalable solutions, I think we should start with the assumption that spatial joins, in particular, require both tables to be resident within the same DBMS, which means on the same computer. This is the basis for the Astronomical Data Mining Facility, or ADEF. We would need to provide facilities for users to do one or more of the following steps:
  1. Upload their own tables to the ADEF, or get the system to copy a standard table from some other server.
  2. Ingest the table to the DBMS.
  3. Create the appropriate index or indices.
  4. Carry out the spatial join, with the choice of leaving the results in situ for further DBMS work, or returning them to the user in some format such as VOTable or HTML.
  5. Perhaps carrying out a number of spatial joins, by doing the last step on several standard catalogues.


Merging Results from Queries of Several Tables

Several astronomical data archive sites already make it possible for the same query, typically a cone-search, to be sent to a number of different servers around the world. The results, however, are just presented as a sequence of separate tables, which in many cases requires a lot of additional work on the part of the user, and certainly hinders the use of the results in subsequent processing. The Vizier service provides perhaps the most advanced service in that one can upload a list of objects of interest, and have the Vizier system search a specified list of tables for each of the user's positions. However if the user's list has N objects and M catalogues are searched, the results appear as a set of N * M separate tables. So what are the prospects of providing something that is a bit more astronomer-friendly?

The simplest case to solve is that of the single cone-search carried out on M catalogues, which might well be distributed around the world. The results will come back (if VO protocols are fully implemented) as a set of M tables in VOTable format. The user will often (usually?) want these merged, so the set of possible identifications of each source are adjacent in the output. The more advanced case, that of performing a spatial join of one table with a set of M other tables (as outlined above) is actually no more complex in what it produces: merely a set of M tables. Each of these will have, in general, a different set of columns. Herein lies the difficulty.

Here is a (very much abbreviated) example of the output you get from Vizier to illustrate the problem:


  ROSAT All-Sky Bright Source Catalogue (1RXS) (Voges+ 1999) ()

    The 1RXS catalogue

To get all details for a row, just click on the row number in the leftmost `Full' column.
The 3 columns in color are computed by Vizier, and are not part of the original data.

Full_r_RAJ2000_DEJ20001RXSRAJ2000DEJ2000PosErrNewFlagCounte_CountHR1e_HR1HR2e_HR2
 arcmin"h:m:s""d:m:s"  degdegarcsec  ct/sct/s        
1 0.034911 21 15.40-60 37 25.51RXS J112115.4-603725170.31416-60.62375 8__.. 4.56e-01 4.66e-02 1.00 0.02 0.88 0.06
  Second ROSAT PSPC Catalog (ROSAT, 2000) (ReadMe)
    The 2RXP Catalog

Full_r_RAJ2000_DEJ2000RAJ2000DEJ2000HR1e_HR1HR2e_HR2Sm_2RXPCrateSrcFlags
 arcmin"h:m:s""d:m:s"degdeg              ct/s  
1 0.04011 21 14.9-60 37 25170.3121-60.6236 0.99 0.00 0.89 0.00S  J112114.9-603725 3.299e+00....islm.p.v.
2 2.06611 20 59.5-60 38 09170.2479-60.6358 0.98 0.01 0.98 0.01    J112059.5-603809 5.861e-02...fislmnpav.
3 2.90611 20 53.4-60 36 16170.2225-60.6044 1.01 0.02 1.02 0.01    J112053.4-603616 5.008e-02..efislm.pav.
49 7.54011 20 32.1-60 42 47170.1338-60.7131-0.42 0.35 1.17 1.21    J112032.1-604247 4.622e-03...fislmnp.v.
50 8.68811 20 05.6-60 35 48170.0233-60.5967 0.95 0.12 0.32 0.12S  J112005.6-603548 6.188e-03....islm.p.v.
  The WGACAT version of ROSAT sources (White+ 2000) (ReadMe)
    The full catalogue

Full_r_RAJ2000_DEJ2000RAJ2000DEJ2000Counte_CountHR1e_HR1HR2e_HR21WGAf_ErrorRadQflagMPEname
 "h:m:s""d:m:s"degdegct/sct/s            arcsec    
1 0.13111 21 15.4-60 37 16170.3142-60.6212 3.76000 0.01500 22.6300 0.4817 2.5580 0.0235J1121.2-6037   1312  
2 3.09611 20 54.4-60 35 39170.2267-60.5941 0.00608 0.00070 6.4000 2.1760 2.0870 0.5292J1120.9-6035   13 7  
3 3.72411 20 44.9-60 37 39170.1871-60.6276 0.00367 0.00061 4.6670 1.7140 1.5880 0.4917J1120.7-6037   13 7  

Note that the original search produced more than 50 matches, almost all of which I have removed to avoid making this document too long. It is also an especially simple example, in that I have not only limited results to one waveband (X-ray) but also to just one mission, ROSAT, and only the XRT instrument on that. Despite this, the three tables show quite a variety of column names. The UCDs are not shown in these results, but they were listed in the original query page.

Some features are obvious even in a tiny specialised sample such as this one:

  • A few columns are common to all tables, for example RAJ2000 and DEJ2000. Indeed if they were not then the join or distributed cone-search would not have been possible in the first place. But most columns only appear in one table, especially in a more general case than this.
  • Even where the names are identical, such as HR1 which appears in all three tables, it is not obvious that the contents are the same. An inspection of the UCDs (not shown in the Vizier output) shows that they all have the same value SPECT_HARDNESS-RATIO, but even that does not guarantee identity. The units could be different. Here, of course, a hardness ratio is unitless, so that does not apply, and in a genuine VOTable or FITS table the units should be available to us when attempting a merger. But even this still does not guarantee that the values are comparable, as the fluxes might have been calculated over different spectral ranges. The current UCD system is not sensitive enough to reflect that, and perhaps it is expecting too much for it to do this much work.
  • Even if the columns do have the same meaning and units (and spectral ranges etc.), can they be merged? Perhaps the user would like to know that two independent measurements of them have been obtained, as this may give an indication of the reliability of the data.
  • Some columns with different names probably correspond to the same value, for example I am fairly sure that Count in tables 1 and 3 is the same as Crate in table 2. The UCDs should confirm this, and indeed back-checking I see they all have a UCD of PHOT_COUNT-RATE_X, but as noted above this does not guarantee comparability.

Possible solutions to the merger problem

Given these problems, it is easy to see why a table merger has not been attempted previously. So far I have come up with these possible solutions, but there are likely to be more:

(1) Simplified table with links to details

The merged output could be a table containing only those columns which were common to all input tables; in most cases this would contain source names and the offsets from the original, and if this was the results of a join, the appropriate columns from the first input to the join. The remaining columns would just be links to the detailed contents, one column per input. Users would simply click on the link when they wanted to know the details. These links could be URLs, but this would require the ability to link to a specific row of each table using some syntax like http://site.domain.org/server?table=tablename&col=tabname&row=12345 This should be easy to set up, but is not something that most archives provide at present. One problem is that the typical DBMS does not support access by row-id natively; again that is easy to set up, but needs to be part of the design.

(2) Wide table with all columns

The other extreme solution is to produce a single merged table containing all the columns of the input tables. Most of the individual fields would be null, of course. It is not clear how useful such a sparse format would be, but this is actually the option chosen for the 1XMM catalogue recently released by the XMM-Newton Survey Science Centre, and which includes the results of a number of cross-matches precomputed using the facilities of CDS.

(3) Some compromise using common UCDs

Between extremes (1) and (2) perhaps we can use UCDs to identify a reasonable number of columns which are common to all or most of the input tables to produce an output table which is narrower than in case (2) but more informative than in case (1). Since the UCD system does not guarantee common units, it would be necessary here to have automatic conversion of physical units to some standard scheme. For example positional errors which variously appear as degrees, radians, arcminutes, arcseconds, and milli-arcseconds would all have to be converted to some agreed unit. Another problem is that UCDs are in some areas not specific enough, and in others too specific. This is not the fault of those applying UCDs, as it is inherently difficult. Sometimes is might be appropriate to merge results from two tables which use slightly different definitions of say, Johnson's V magnitude, sometimes it might not. Perhaps the astronomer has to be forced to choose.

(4) Produce a heterogeneous structure

Another possibility is to abandon the pretence that the results of a merger can be a simple uniform table. If the results are to be something like a VOTable then we have in principle all the flexibility of XML at our disposal: there is no reason why different rows should not contain different columns. It would be up to the user's software, such as a browser, how to represent the heterogeneity of the data structure (could be a classical table with lots of nulls, could have each field separately identified).

Discussion

These are my thoughts so far - I have set up a forum page here for comments.

Science Cases and Use-cases

I have started a separate page to list these, DataFederationCases, and all contributions are welcome.

Work Plan

The work to be carried out in Iteration 3 is described in AgCd10DataFederation

-- ClivePage - 11 Jul 2003 - a few minor revisions.


Note: an earlier version has had a mild witticism removed at the request of somone who found it "unacceptable". I think I have now removed all humour and thus made the text suitably anodyne and boring; anything humorous (or indeed interesting) that remains is unintentional. Anyone else finding objectionable remarks herein, please let me know at once and I'll exercise further censorship
Edit | Attach | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2003-07-11 - 15:01:49 - ClivePage
 
AstroGrid Service Click here for the
AstroGrid Service Web
This is the AstroGrid
Development Wiki

This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback