Use Case: join tables and remember the join
Motivation
Frequently, we need to query joined tables of data. When the tables to be joined
have similar coverage, then the join requires a full scan of each table and may be
expensive. It makes sense to record the joined table for possible reuse; it's
likely that a user will make several queries on the same joined table.
However, a joined table with a large number of columns may be too large to keep
in VOSpace for any great time. Therefore, it is useful to store only an index
that allows the join to be recreated cheaply.
Story
User wants to query jointly two tables, both large and both with all-sky coverage.
This means that tables have to be joined in a RDBMS and that there is no small
area of intersection that can be isolated to reduce the size of the join.
User sets up and runs a Crossmatcher task that analyzes the join and returns a plan.
This task runs as a single-step workflow, or as a direct invocation of an application
service. User passes to Crossmatcher the IVOIDs of the tables to be joined and the
parameters controlling the spatial matching of objects. Crossmatcher looks up the
details of the tables in the registry, especially the coverage, and notices that they
are large, all-sky tables. Crossmatcher then derives, and returns to User's VOSpace,
three workflows as follows.
The first workflow joins the input tables in a VOStore backed by a RDBMS and partnered
by an ADQL service operating on the same DB. The flow divides the area covered by the
inputs into tiles on the sky and has the following steps.
- Query each input catalogue for each tile, deriving matching pairs of sub-catalogues.
- Combine each pair of sub-catalogues.
- Assemble the joined catalogues into one large table.
The queries on the input catalogues are 'SELECT *' queries contrained by sky coordinates.
The joined catalogues, including the final, assembled catalogue, have columns repeating
the primary keys in the input catalogue and a column assigning a new primary key for the
join; Crossmatcher attaches special meaning to these.
The second workflow runs one query on the joined and assembled catalogue. The query
drops all the columns except the three keys noted above. I.e., it throws away all the
information except the minimum needed to recreate the joined table.
The third workflow reverses the effect of the second workflow. It recovers from the original
input catalogues the missing columns.
User has the chance to edit these workflows before running them.
User runs the first workflow immediately; it takes some time to run. User then queries the
joined table, getting quick responses.
User then finds that the full, joined table cannot be stored for very long (it is too large),
so runs the second workflow to reduce the size of the table. This removes, say, 497 of 500
columns and frees all but 1% of the storage.
User later wants to use the joined table again but realizes that it cannot be stored for
very long. User edits the third workflow made by Crossmatcher and reduces the number of
retrived columns to, say, the 50 most useful. User runs the workflow and can retain the
result for as long as needed.
Implemention notes
Crossmatcher is a CEA service, one that we don't yet have. It is not the single-table
crossmatcher that we have already deployed.
The joining of tiles can be done in parallel. If a grid is available, this is the preferred
way. If no grid is available, then it is better to do the tile joins sequentially (or in
very small batches). Crossmatcher may need to know about the available processors.
This case needs a relational VOStore. It's not clear that table joins are useful unless the
joined table can be got into a RDBMS. However, it may be possible to join all the tiles and
to store the results in files, then later to move those files into a RDBMS that is not
associated with a VOstore. Clearly, this second way is non-standard and the Crossmatcher
task can't plan to use it.
--
GuyRixon - 13 Jun 2005