Search for Brown Dwarf Stars in the Hyades
The search for brown dwarf stars is one of the Astrogrid "Top 10" science cases, so I was interested to find recently that one of our research students was undertaking a search for brown dwarfs in the Hyades star cluster. This is just the sort of research work for which the VO was invented. More than that, this programme is especially modest in its demands on Virtual Observatory (VO) infrastructure: it needs no image-processing, no time-series analysis, no spectral analysis, and depends on no specific external software packages. It mainly needs acess to two well-known source catalogues: 2MASS and USNO-B. Most of the required operations are within the grasp of a database management system, and it just happens that we have both of them loaded into Postgres on two Astrogrid machines: hydra and cass123.
These notes on the scientific requirements of the project are derived from a lunch-time seminar that the student gave, and from subsequent conversations and emails with her. She started out doing the work using just the facilities of the web: this involved some hundreds of separate cone-seaches (because existing web-sites limit the amount of data you can retrieve from one search) and writing software to process the text files locally. She is now repeating the work using the databases loaded on hydra.
I have set out the work required into a number of discrete steps, and for each noted firstly the scientific requirements, secondly how this can be done using SQL on a local DBMS, and thirdly how this could be done using the proposed VO facilities, for example using ADQL/text (alias SkyQL).
(1) Select stars in the Hyades region of sky from USNO-B
Science: Hyades is, I think, the nearest star cluster nearest to the Sun. It is centred at (4h 25m 53s, 15d 52m 12s) and a circle of radius 13 degrees is needed to encompass all its stars. The first selection, therefore, is just a simple cone-search. The reason for doing this as a separate stage is that this simple selection removes all but about 1% of stars in the sky, and that two separate trails lead on, a data subset, and a control subset.
SQL: When using a large catalogue it is essential to use an index, indeed a spatial index. The syntax then becomes DBMS-specific. The index available in our installation of Postgres is an R-tree, which selects objects in any rectangle on the sky, given two corners. A subsequent filtering finds sources not just within the bounding rectangle but actually within a circle of 13 degrees radius centred on the given position: this needs some mildly messy trigonometry, but I have already set up a user-defined function called
gcdist to handle this: its arguments are assumed to be in degrees, as are the columns in the table. DBMS are notorious for not handling physical units; the astronomer has to take care of such vital details. The SQL required is then just:
SELECT * FROM usnob INTO h1
WHERE errbox && box(point(66.72-13,15.87-13),point(66.72+13,15.87+13))
AND gcdist(66.72,15.87,ra,decl) < 13.0 ;
Note that the
INTO clause is also non-standard SQL, but without it any SQL statement just returns you text; since this is just the first stage in a multi-stage selection process, this is clearly not what is required.
VO: The equivalent selection in ADQL/text is simpler, using the
REGION function; the data access layer has to translate this into whatever syntax is appropriate for the DBMS and index that is in use. Unfortunately ADQL has, so far, no way of specifying the destination of the query. I shall ignore this in these examples, and return to the topic later.
SELECT * FROM usnob WHERE REGION('Circle J2000 66.72 15.87 13.0');
Comments:
- Notation I have set out these examples using upper-case for SQL or ADQL keywords, and lower-case elsewhere; this is just to make the examples clearer. SQL is generally case-insensitive, and as far as I know, ADQL/text is also.
- Dataset size: Another reason for selecting INTO another table on the same DBMS is that this first selection produces over 8 million rows. Since the table has some 89 columns, if the results were exported as a VOTable, the data volume would be around 6 GB (for TABLEDATA) or 1 GB (for BINARY or FITS forms of VOTable). If we support select INTO we may, as here, be able to avoid exporting such large files, but it is clearly desirable for our data access layers to be able to export in efficient forms of VOTable, and not just in the simplest form of TABLEDATA.
- Registry: It is also worth noting that the Registry has not been needed so far; in this, as in our other Top 10 science cases, the astronomer knows what resources to use and where they are located. [Main.NicholasWalton - 12 May 2004: I'm not sure this is a true statement. in this case the astronomer is chosing to just search the usno-b and cross-ref with 2mass. in the more general case, there may be other relevant observations of the hyades, and certainly if you wanted to go deeper you'd be looking at sdss/ukidss for instance - you want a registry query to find out if these survey's had observed this patch of sky]. That is a good point. Another consideration is that the user might want to use the Registry to find the location of a site which happens to have USNO and 2MASS loaded in the same DBMS (ClivePage - 13 May 2004]
- SELECT columns: For simplicity I have used
SELECT * in these examples, which selects all columns. This is perhaps what the user would normally do, as it is easy, and defers a decision on which to select. In practice a discriminating astronomer, or one short of disc space, might just select the essential columns from the large number available.
(2) Select stars with proper motions in the appropriate range
Science: The proper motions (actual velocity vectors) of stars in the Hyades cluster are known to fall into a specific range, so the next step is to select from the 8 million candidate objects only those with known proper motions in the required range. This reduces the number of objects to only around 50,000, which is much more manageable.
SQL: The selection from the previous subset is very simple to do:
SELECT * FROM h1 INTO h2
WHERE pmra BETWEEN 50 AND 100 AND pmdec BETWEEN 0 AND -100;
VO: At present ADQL cannot produce subsets, so the selection would have to be done from the base table, using syntax like this:
SELECT * FROM usnob WHERE REGION('Circle J2000 66.72 15.87 13.0')
AND pmra BETWEEN 50 AND 100 AND pmdec BETWEEN 0 AND -100;
Comments:
- Indexing: The SQL statement selects from the subset without the assistance of an index. A modern DBMS is fast enough that doing this from a table of 8 million rows is feasible; but if several selections had to be made, it would be desirable to create an index, on say pmra or pmdec. This is easy using SQL, but not supported in ADQL. As well as
CREATE INDEX it would be necessary to support DROP INDEX .
- Physical Units: This is a problem that has been nagging us for some time, but which, as far as I know, is still unsolved. Basically how does the user know when formulating this query that the units of
pmra and pmdec in this table are milli-arcseconds/year and not one of many other equally justifiable choices? At present astronomers do this using their expert knowledge, which means trawling through documents or knowing the right person to ask. The VO needs a better solution. DBMS are notorious for ignoring the metadata problem but we cannot. In principle the Registry will be able to store metadata including units for each column of each table in the VO. This still leaves open the question of how the user specifies the query: I think there are two options here.
- The simplest solution would be to have a standard for each quantity which every one uses in queries, and the VO then converts the value to that used by the column in question. One problem is that the agreed ISO and IAU Standards for units are somewhat lacking in user friendliness. For example: the ISO standard unit for angles is radians, and time is seconds, so the standard unit for proper motions would be radians/second. This would be somewhat unfamiliar to most astronomers.
- An alternative would be to allow units strings to be given in the query, for example
WHERE pmra BETWEEN 50 mas/yr AND 100 mas/yr but this would need a significant extension to our parsing abilities, and agreement on standard strings for units. It would be a lot safer, however, and I think I favour this option.
(3) Select a control set of stars with different proper motions
Science: This control subset is chosen from the same patch of sky, but with proper motions in the opposite direction. The reason for the control subset is to get an idea of how many field stars are likely to have been selected in the "hyades" set by accident. The syntax is pretty much as before:
SQL: the query is very much like the previous one:
SELECT * FROM h1 INTO c2 WHERE pmra BETWEEN -50 AND -150 AND pmdec BETWEEN 0 AND 100;
VO: Again this selection can be done from the base table; the main penalty is that the initial selection of say 8 million objects from one billion takes some time, and it seems a pity to have to do it twice. If a spatial index exists, any reasonable optimiser will use it to do the crude selection, and with the selection on pmra and pmdec ranges left until later.
SELECT * FROM usnob WHERE REGION('Circle J2000 66.72 15.87 13.0')
AND pmra BETWEEN -50 AND -150 AND pmdec BETWEEN 0 AND 100;
Comment:
Workflow: Given the need for two queries identical except for two parameters, it would be nice for the workflow component of the VO to be able to handle things like this without requiring the user to retype the whole query.
(4) Check the results
Science: At this stage the astronomer wants to be sure that the optimum selection parameters have been chosen, ideally by plotting some parameters of the resutls.
SQL: Unfortunately SQL databases have no facilities to do graphics or plotting, but one can use SQL to find simple statistics, for example like this:
SELECT count(*), min(pmra), avg(pmra), max(pmra), min(pmdec), avg(pmdec) max(pmdec)
FROM hyades2;
Some DBMS (but not all) go as far as a function to find standard-deviation or variance. Obviously one would want to do similar checks on the control subset, which I'll omit in the interests of brevity.
VO: At present ADQL could be used to compute the basic statistics (I think) and to produce a crude histogram, but there is no equivalent of
select into it is hard to see how to interface it to an external graphics package.
Comments:
- Histograms: One gets a better idea of the quantities in one's sample if one can produce a histogram of the distribution of values; this is just about possible using SQL, but awkward. There are some limitations, for example if any bin has zero counts, that row in the output table is just omitted. Here's a simple example:
SELECT FLOOR(pmra/10), COUNT(FLOOR(pmra/10)) FROM h2 GROUP BY FLOOR(pmra/10) ORDER BY FLOOR(pmra/10);
floor | count
-------+-------
5 | 5965
6 | 5647
7 | 5115
8 | 4621
9 | 4422
10 | 4203
11 | 4055
12 | 3934
13 | 3867
14 | 3689
(10 rows)
- User-defined functions: I have coded up a few user-defined functions (called stored procedures in some other DBMS) to present basic statistics, histograms etc. These are handly for the local user of SQL, it is not clear whether they could be used in ADQL, as no extension mechanism exists at present, nor any way of enquiring which extra functions are available in any given database installation.
- Graphics: What is really wanted, of course, is a graphical display of various plots, so that distributions can be seen on screen. Tools such as VOPLOT and TOPCAT can handle VOTable output. At present we do not have any way of extacting data from a DBMS table direct to such a tool, although TOPCAT (and the FITS viewer
fv )can take data given a suitable URL, so not much work should be needed to get this operational.
(5) Compute the direction of Proper Motion
Science: At this point the equations get a bit more complicated and a diagram would be useful, but that's not easy in this wiki. Essentially you want to compute the observed and true proper motion vector, and then select only those stars which appear to be going towards the common convergence point (CP). A few more quantities need to be computed for each star, given the convergence point at racp=94.2, deccp=7.6 degrees. The necessary equations are these (though I may well have got the details slightly wrong):
D = acos(sin(decl) * sin(7.6) + cos(decl) * cos(7.6) * cos(ra - 94.2))
T = acos( (sin(7.6) - sin(decl) * cos(D)) / (cos(decl) * sin(D)))
theta = atan2(pmra, pmdec)
dtheta = 1.5 * sqrt(raerr^2 + decerr^2) / sqrt(pmra^2 + pmdec^2)
Note that in practice the function
radians is needed at several points here in the SQL, as most quantities are in degrees.
SQL: This needs statements to create the additional columns, then populate them with UPDATE statements, e.g.
ALTER TABLE h2 ADD COLUMN (d float, t float, theta float, dtheta float);
UPDATE h2 SET d = acos(sin(radians(decl)) * sin(radians(7.6)) + cos(radians(decl)) *
cos(radians(7.6)) * cos(radians(94.2-ra))));
UPDATE h2 SET t = acos((sin(radians(7.6)) - sin(radians(decl)) * cos(d)) /
(cos(radians(decl)) * sin(d))));
UPDATE h2 SET theta = atan2(pmra,pmdec);
UPDATE h2 SET dtheta = 1.5 * sqrt(ra_err*ra_err + dec_err*dec_err)/
sqrt(pmra*pmra + pmdec*pmdec);
The syntax is a bit verbose because of the absence of an operator for squaring: there are functions to do this, usually POW or POWER, but they are not standard. The most standard way of solving this is just to multiply a quantity by itself.
The same calculations are, of course, needed on the control table.
VO: It is obviously highly desirable to allow the user to UPDATE these working tables (but not to update the standard catalogues), and to use ALTER TABLE as well. These facilities are at present not part of the ADQL specification.
Comment:
- Units: As noted earlier, the Registry is supposed to be capable of keeping track of the unit sf the columns in the tables which are standard atalogues, so that the user will be able to specify queries using explicit or standardised units, and have them converted automatically. Ideally the Registry should also be able to keep track of columns in these working tables which the user creates, and new columns created in them. I don't think we have any plans to do this, but it seems just as necessary.
(6) Select objects with the right proper motion vectors
Science: This is another simple selection.
SQL: no surprises here:
SELECT * FROM h2 INTO h3
WHERE t BETWEEEN theta - dtheta AND theta + dtheta;
VO: It is worth noting that ADQL
could be used to do the whole selection so far in one operation. The syntax is something like this:
SELECT * FROM usnob WHERE REGION('Circle J2000 66.72 15.87 13.0')
AND pmra BETWEEN -50 AND -150 AND pmdec BETWEEN 0 AND 100
AND acos( (sin(radians(7.6)) - sin(radians(decl)) *
cos(sin(radians(decl)) * sin(radians(7.6)) + cos(radians(decl)) *
cos(radians(7.6)) * cos(radians(94.2-ra))))) /
(cos(radians(decl)) * sin(acos(sin(radians(decl)) * sin(radians(7.6)) + cos(radians(decl)) *
cos(radians(7.6)) * cos(radians(94.2-ra)))) )) BETWEEN atan2(pmra,pmdec) -
1.5 * sqrt(ra_err*ra_err + dec_err*dec_err)/
sqrt(pmra*pmra + pmdec*pmdec) AND atan2(pmra,pmdec) +
1.5 * sqrt(ra_err*ra_err + dec_err*dec_err)/
sqrt(pmra*pmra + pmdec*pmdec);
I haven't had time to check this, and may have got mis-matched parentheses.
Comment:
- Query Construction: This may be a slightly artificial example, but some scientific queries are going to be pretty complicated. The question is: do our user interfaces (e.g. The Portal) give the user enough assistance to make it easy to build query without making mistakes? The attractions of doing one small query at a time, and refining the results, as traditionally done by astronomers, rather than doing everythign in one big bite, as done by SQL-enthusiasts, begin to look attractive to me.
(7) Compute stellar distance and select on distance range
Science: The next step is to estimate the actual distance of each star, from its proper motion and the angle D.
SQL: This involves computing a further column in the table, and then selecting from it, producing further subset h3.
ALTER TABLE h3 ADD COLUMN (dist float);
UPDATE h3 SET dist = 46.7 sin(d) / (4.74 * sqrt(pmra*pmra + pmdec*pmdec)) ;
SELECT * from h3 INTO h4 WHERE dist BETWEEN 32.0 AND 60.0 ;
VO: The syntax for the selection is like SQL. In this case the selection is fairly simple, and it would be possible to do the selection and projection in one valid ADQL statement, without forcing a new column to be added. For example:
SELECT *, 46.7 sin(d) / (4.74 * sqrt(pmra*pmra + pmdec*pmdec)) AS dist
FROM h2 WHERE 46.7 sin(d) / (4.74 * sqrt(pmra*pmra + pmdec*pmdec))
BETWEEN 32.0 AND 60.0;
Of course without SELECT INTO, this selection needs to be incorporated in the WHERE condition in the last ADQL query - an exercise for the reader.
(8) Cross-match the resulting subset with 2MASS
Science: The initial selection was from USNO-B which has optical magnitudes and proper motions: by cross-matching with 2MASS we get infra-red magnitudes of each object as well.
SQL: The cross-match becomes a spatial join, which has a syntax heavily DBMS-dependent. Here we are using Postgres with R-tree indices. The basic syntax is something like this:
SELECT * FROM h4,twomass INTO h5 WHERE h4.errbox && twomass.errbox AND
gcdist(h4.ra,h4.decl,twomass.ra,twomass.decl) < 0.003 ;
Actually it cannot be quite a simple as this, as there will be clashes of column names in the two tables; we know they both have columns named
ra and
decl as these were chosen to be identical for convenience. The select expression will therefore start off something like this:
SELECT h4.ra AS hra, h4.dec AS dec, twomass.ra AS tra, twomass.dec AS tdec,
h4.b1mag AS b1mag, twomass.kmag AS kmag,
and many lines like these. It will often be desirable to include a distance, which can be computed easily if a great-circle distance function has been provided, e.g.
gcdist(h4.ra,h4.dec,twomass.ra,twomass.dec) AS offset,
VO: The current ADQL specification includes a cross-match function, but it is not clear to me exactly how it operates. In this case where both USNO-B and 2MASS have been loaded into Postgres on the same computer, and have had spatial indices created for them, the ADQL and the equivalent SQL look something like this:
SELECT * FROM h5 AS h, twomass AS t INTO h6 WHERE XMATCH(h,t) < 3;
Comment:
- The XMATCH function needs some further thought, it seems to me, as it is desirable to give the astronomer control over matching radius in some way.
(9) Produce more graphical plots
Science: Here the astronomer wants to produce a number of plots, e.g. colour-colour diagrams. The problems of doing this in raw SQL or in ADQL are the same as before.
(10) Find absolute magnitude
Science: Since the distance and apparent K magnitude are now known, one can find the absolute K magnitude of each star. This involves adding another column and computing values using UPDATE.
SQL:
ALTER TABLE h6 ADD COLUMN (abskmag real);
UPDATE h6 SET abskmag = kmag + 5 - 5 * log10(dist);
(11) Select section of colour-colour diagram
Science: In the next operation a graphical display is pretty much essential: the astronomer wants to display a plot of the last subset with
abskmag plotted against
(kmag - b1mag) or something similar. Finally it is necessary to select sources on that diagram lying below some line. The exact positioning of the line depends on some astronomical expertise. If it simply a straight line then its coordinates can be converted into a simple linear equation involving the magnitudes involved, and then an SQL (or ADQL) select statement be used. It would obviously be nice if the graphical utility allowed such a line to be positioned interactively, and the necessary selection parameters read off. I don't think that VOPlot or TOPCAT support this at present.
The astronomer's selection could easily be more complex than a simple line on a diagram, it could include a region of more complex shape. Again it would be nice to have a tool which allowed a polygon to be drawn on the diagram showing which sources to select, and then have the system turn this into a selection expression which could then be used in SQL/ADQL.
--
ClivePage - 10 Feb 2004