r3 - 12 Aug 2002 - 15:23:26 - BobMannYou are here: TWiki >  Astrogrid Web  > SQLServerEvaluation

WP-A4: SQLServer Evaluation

DBMS Version:
Microsoft SQL Server 2000 Enterprise Edition
N.B. Choice of MS SQL Server versions discussed at http://www.microsoft.com/sql/techinfo/planning/SQLResKChooseEd.asp.

Test platform:
Windows 2000 Advanced Server, running on Toshiba Tecra 9000 laptop (Pentium III, 1.2GHz, 512 MB).

Availability:
both SQL Server 2000 (in several versions) and Windows 2000 (ditto) are available as part of the Microsoft Academic Alliance scheme, for which the total cost is currently $800 per annum per department.

Installation:
both Windows 2000 Advanced Server and SQL Server 2000 Enterprise Edition installed straight from the CD, with no problems.

Setting up test database:
Most database operations conducted using Enterprise Manager GUI. It has an Action button, from which a drop-down menu appears, including a New database item, which sets up a new database, with options to change default settings for location, initial size, etc.

Schema generation for test database:
The easiest way to generate a schema seems to be to run the appropriate SQL commands in the Query Analyser GUI. Setting up the gsc and usno tables in the astrogrid database took the following commands:


use astrogrid

create table gsc(
   pcode       int not null, -- HEALPix index
   gscid      int not null, -- GSC map number (4 digits) // GSC_ID (5 digits)
   gscra      float not null, -- RA (radians)
   gscdec       float not null, -- Dec (radians)
   pos_err      real not null, -- posn error 
                (assumed 1 sigma, PCODE generated using 3 sigma)
   mag      real not null, -- magnitude
   mag_err    real not null, -- magnitude error
   mag_band    int not null, 
   class        int not null,
   plate_id   varchar(4) not null, -- 4 chars
   multiple   varchar(1) not null -- should be logical
   )

create table usno(
   uid      int not null, -- arbitrary sequential number
   ura      float not null, -- RA (radians)
   udec      float not null, -- Dec (radians)
   rmag      real null, -- there are some nulls here
   bmag      real null, -- ditto
   upcode      int not null -- HEALPix code
   )

Loading the data:The easiest way to load ASCII data into SQL Server is to use the DTS Import/Export Wizard GUI, which is launched from the Enterprise Manager GUI: the Data Transformation Services (DTS) packages provides "a set of graphical tools and programmable objects that lets you extract, transform, and consolidate data from disparate sources into single or multiple destinations".

The DTS Import/Export Wizard leads the user through a set of steps specifying the input data format and the database table location, and offers the user tha ability to save the parameters thus set into a package that that can be run again later. CGP's USNO input catalogue contained a number of "\N" entries denoting null values. SQL Server didn't like those, so I had to replace them by a default value before I could get the USNO dataset to load properly. My one previous experience of loading into SQL Server data that contains nulls suggests that it is far easier to do that using a CSV (comma-separated variable) format input datafile -- SQL Server happily reads in a row containing entries "...1,2,,3..." and understands that there's a null value being set between the 2 and the 3.

There are other ways to import data from files. From the Query Analyser GUI, one can run a BULK INSERT command and there is also the bcp command prompt utility which does the same thing: I think that these two options would allow the import of data in binary format (not possible through the DTS Import/Export Wizard), but I'm not 100% sure of that.

Indexing the tables:
Creating a B-tree index on pcode values is readily done by running the following SQL commands in the Query Analyser window:

create index gsc_pcode_index on gsc(pcode) 
create index usno_pcode_index on usno(upcode)

Working out how much extra space is taken up by the index is a slightly non-trivial, due to the way that SQL Server reserves disk space for itself. The user can specify how this can be done, but the default is to leave it up to SQL Server to grow the database as and when required. It seems that, when using this option, disk space reserved for a particular database is not immediately freed once the data filling it is deleted. For example, I was playing around, loading different-sized subsets of the USNO catalogue in SQL Server, to see how the load and index time scaled, and found that the size of the database file didn't vary with the number of rows. It took me a little while to work out that SQL Server had expanded the disk space reserved for the database as I was loading the full USNO catalogue, but hadn't released it immediately when I deleted that table and tried loading smaller subsets.

Simple join on PCODE values:
CGP's first test query

select * from gsc, usno where pcode=upcode
takes about 30 seconds to run (finding 358455 matches), but running a second query, which also writes the results into a new table
select * into pcodetable from gsc, usno where pcode=upcode
straight afterwards only takes 14 seconds, presumably because much of the data is still cached.

Great-circle join:
CGP's next test query was "a proper great-circle join with the additional weeding needed". It took me a little time to translate his suggested SQL command into a command that SQL Server could run, namely:

select distinct *, round(degrees(7200*asin(sqrt(square(sin(0.5*(u.udec-g.gscdec))) + 
cos(u.udec)*cos(g.gscdec)*square(sin(0.5*(u.ura-g.gscra)))))),2) as sep
into  myjoin 
from gsc as g, usno as u
where g.pcode=u.upcode and 
2.0*asin(sqrt(square(sin(0.5*(u.udec-g.gscdec))) +
 cos(u.udec)*cos(g.gscdec)* square(sin(0.5*(u.ura-g.gscra))))) 
 < 5e-6+pos_err
but that was mainly due to a combination of lack of familiarity with SQL and typos. That ran in 25 seconds, writing 241460 rows into the myjoin table.

Statistical queries:
Given that I'd had to put default values into my USNO table, in place of the nulls that had been present in the original dataset, my statistical queries required an extra where statement to filter out those defaults, i.e.

select avg(bmag) from usno where bmag < 99.0
for the average of the USNO B mags, and
select stdev(bmag) from usno where bmag < 99.0
for their standard deviation. Together, these two queries took 15s to run.

Defining user functions:
At CGP's suggestion, I wrote a user function to compute the great-circle distance between two points. Taking a user function of Jim Gray's as a template, it took me literally 5 mins to write the following


IF EXISTS (SELECT name 
      FROM   sysobjects 
      WHERE  name = N'fGreatCircleDist' )
DROP FUNCTION fGreatCircleDist


CREATE FUNCTION fGreatCircleDist (@ra1 float, @dec1 float, @ra2 float, @dec2 float)
-------------------------------------------------------------
--/H Returns float holding great circle distance [in arcsec] 
--/H between two points (ra1,dec1) & (ra2,dec2) [ras,decs in radians] 
-------------------------------------------------------------

RETURNS float
AS 
begin
return 2.0*asin(sqrt(square(sin(0.5*(@dec1-@dec2))) + 
 cos(@dec1)*cos(@dec2)* square(sin(0.5*(@ra1-@ra2)))))
END
which tests if there already exists a function called fGreatCircleDist, and, if not, defines it in the appropriate manner. This I tested with the following
declare @ra1 float
declare @dec1 float
declare @ra2 float
declare @dec2 float
set @ra1=1.0
set @dec1=1.0
set @ra2=2.0
set @dec2=2.0
print dbo.fGreatCircleDist(@ra1,@dec1,@ra2,@dec2)

and checked the results against those from the gcirc procedure from the IDL astrolib library.

Conclusions:

All in all, I have been very impressed by the ease-of-use of SQL Server, which is probably about all that can be assessed by such a brief evaluation procedure. Many of the basic database operations are performed using GUIs, which I liked, and there is a voluminous collection of "Books online" documenting the system and all its features.

I don't think that this evaluation revealed anything much that we didn't already know or suspect beforehand. The fact that SQL Server has been used so successfully to produce the SDSS SkyServer system shows that there are no obvious "show-stoppers" as regards its use for astronomical catalogues. The HTM(Hierarchical Triangular Mesh) sky pixelisation scheme developed at Johns Hopkins for the SDSS has been implemented into SQL Server and I found it trivial to install that into the test SuperCOSMOS database that Jim Gray, Nigel Hambly and I currently producing with SQL Server: all I had to do was to copy a DLL file into the appropriate folder, load a few spatial functions (in the manner of the great-circle distance function above) and that was it.

The SuperCOSMOS Sky Survey database we're currently creating will provide a more serious evaluation of SQL Server: it'll be ~1TB in size, has a more complex schema than these simple USNO and GSC catalogues and will be exposed to external users. It is our intention to produce a federation of the SDSS-EDR and SSS (i.e. do the work of the axed optical/near-infrared pilot, WP-A5.1) using SQLServer in the next few months. WFAU are also developing a database for the 6df Galaxy Redshift Survey using SQLServer, and it remains the default choice for the WFCAM Science Archive.

One clear advantage of SQL Server, as far as AstroGrid is concerned, is its level of integration into the .Net framework, which should make it very easy for us to develop and deploy web services that interface to SQL Server databases. Of course, this also touches on one of SQL Server's main disadvantages, which is its total reliance on other Microsoft proprietary technologies, tying us into the use of Windows, etc.

-- BobMann - 12 Aug 2002

Edit | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r3 < r2 < r1 | More topic actions
 
AstroGrid Service Click here for the
AstroGrid Service Web
This is the AstroGrid
Development Wiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback