Use Case: join tables and remember the join

Motivation

Frequently, we need to query joined tables of data. When the tables to be joined have similar coverage, then the join requires a full scan of each table and may be expensive. It makes sense to record the joined table for possible reuse; it's likely that a user will make several queries on the same joined table. However, a joined table with a large number of columns may be too large to keep in VOSpace for any great time. Therefore, it is useful to store only an index that allows the join to be recreated cheaply.

Story

User wants to query jointly two tables, both large and both with all-sky coverage. This means that tables have to be joined in a RDBMS and that there is no small area of intersection that can be isolated to reduce the size of the join.

User sets up and runs a Crossmatcher task that analyzes the join and returns a plan. This task runs as a single-step workflow, or as a direct invocation of an application service. User passes to Crossmatcher the IVOIDs of the tables to be joined and the parameters controlling the spatial matching of objects. Crossmatcher looks up the details of the tables in the registry, especially the coverage, and notices that they are large, all-sky tables. Crossmatcher then derives, and returns to User's VOSpace, three workflows as follows.

The first workflow joins the input tables in a VOStore backed by a RDBMS and partnered by an ADQL service operating on the same DB. The flow divides the area covered by the inputs into tiles on the sky and has the following steps.

  1. Query each input catalogue for each tile, deriving matching pairs of sub-catalogues.
  2. Combine each pair of sub-catalogues.
  3. Assemble the joined catalogues into one large table.

The queries on the input catalogues are 'SELECT *' queries contrained by sky coordinates. The joined catalogues, including the final, assembled catalogue, have columns repeating the primary keys in the input catalogue and a column assigning a new primary key for the join; Crossmatcher attaches special meaning to these.

The second workflow runs one query on the joined and assembled catalogue. The query drops all the columns except the three keys noted above. I.e., it throws away all the information except the minimum needed to recreate the joined table.

The third workflow reverses the effect of the second workflow. It recovers from the original input catalogues the missing columns.

User has the chance to edit these workflows before running them.

User runs the first workflow immediately; it takes some time to run. User then queries the joined table, getting quick responses.

User then finds that the full, joined table cannot be stored for very long (it is too large), so runs the second workflow to reduce the size of the table. This removes, say, 497 of 500 columns and frees all but 1% of the storage.

User later wants to use the joined table again but realizes that it cannot be stored for very long. User edits the third workflow made by Crossmatcher and reduces the number of retrived columns to, say, the 50 most useful. User runs the workflow and can retain the result for as long as needed.

Implemention notes

Crossmatcher is a CEA service, one that we don't yet have. It is not the single-table crossmatcher that we have already deployed.

The joining of tiles can be done in parallel. If a grid is available, this is the preferred way. If no grid is available, then it is better to do the tile joins sequentially (or in very small batches). Crossmatcher may need to know about the available processors.

This case needs a relational VOStore. It's not clear that table joins are useful unless the joined table can be got into a RDBMS. However, it may be possible to join all the tiles and to store the results in files, then later to move those files into a RDBMS that is not associated with a VOstore. Clearly, this second way is non-standard and the Crossmatcher task can't plan to use it.

-- GuyRixon - 13 Jun 2005

Topic revision: r1 - 2005-06-13 - 09:08:29 - GuyRixon
 
AstroGrid Service Click here for the
AstroGrid Service Web
This is the AstroGrid
Development Wiki

This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback