Notes on Sybase IQ
Jim Lewis of
IoA, Cambridge, arranged for a one-month evaluation licence for Sybase IQ, a product designed for data mining and analytical processing. Its principal selling points are column-oriented storage combined with clever indexing options. Together these are said to provide huge performance improvements compared with the usual row-oriented storage of an ordinary RDBMS, especially for queries which only reference one or a few columns.
A meeting was held at
IoA on 9th April with technical and sales reps from Sybase to explain the product and carry out some tests, after Jim had been using the product for about two weeks. He had managed to ingest the whole one billion rows of the of USNO-B catalog. These are my notes from the meeting: perhaps those present will correct any mistakes that I've made.
Product and Platform
Sybase IQ is a completely separate product from the Sybase ASE relational DBMS which has been used for some years at a number of astronomical sites. Some effort appears to have been made to make the user interfaces similar, but otherwise they are unrelated.
It is available on Solaris and several other platforms, but not Linux. The Sybase reps thought that a port to Linux was quite likely eventually, but there were no current plans.
There is no support for distributed databases, probably because the networking overheads would negate the advantages of the architecture. Sybase IQ can be run on a multi-processor server, but any one query only uses a single processor, so this is useful only when there are several concurrent queries.
There is a product named IQ Accelerator, which will work with third-party data warehouses (Oracle, DB2, Teradata).
Architecture and Indexing
Sybase IQ is most efficient when given a raw disc, but can work within a container file in a normal file system, with space pre-allocated for it. It is possible to add more storage space to a system which exceeds the capacity given to it initially.
Data in each column are stored separately, and in some compressed format which allows for fast querying.
Several types of index are supported, some optimised for columns of low cardinality (i.e. where the number of distinct values is below 1500). This applies to some columns in astronomical tables (e.g. filter settings, spetral type) and possibly others such as magnitude if one can tolerate rounding to say the nearest 0.1 magnitude. Index types include:
- Default (described as "Fast Project" index)
- Low cardinality fast index
- High cardinality non-group index - for aggregation and range searches
- High cardinality group index - for key fields and grouping for cross-tabulations
- Word/phrase index - for text searches
- Compare index - result of comparison between two defined columns
- Join index - to precompute results of join between two predefined tables
Index creation is said to be relatively fast.
There was no multi-dimensional (or spatial) index.
Querying and Interfaces
It supports SQL92 with some extensions to cover their extra index types. There are drivers for ODBC and JDBC. There were no facilities to generate XML (or HTML) directly, but this can be done with suitable code via the JDBC interface, of course. The interface is close to Sybase ASE, but sufficiently different to be confusing occasionally ('sp_help' does not work as expected, for example).
There seem to be no facilities to load data directly from binary files (or from tables stored in other RDBMS) the normal (and "best") way of loading data was via text (CSV) files. The data loader seemed to run faster than those of typical RDBMS.
It is possible to write user-defined functions in Java (but not other languages, apparently). Stored procedures written in SQL are also possible; they may be executed more efficiently.
Demonstration/Evaluation
The product was installed on a Sun Blade 1000 system running Solaris-8 without problems (but a Sybase expert assisted Jim Lewis with this process). Ingestion of the whole USNO-B catalog was relatively easy but required the data to be converted from its original binary to text (CSV) format. Jim's Perl scripts to do this had not converted all the null values correctly, but this did not matter much for the current testing.
The data loading rate was 13k rows/sec (perhaps limited by non-local disc, Jim reported up to 24k rows/sec when using a local disc). Generation of 80GB IQ filesystem takes ~2 hours. Ingestion of the whole of USNO-B took under a day.
Index creation was very fast: creation of indices for the APM catalogue (300 million rows) took about 2.5 days on Sybase ASE, but under an hour for Sybase IQ. It is recommended that the table structure, including its indices, be defined initially, before data are loaded, but further indices can be created subsequently.
Query performance seemed very impressive: selections on, say, magnitude with a small range took only a minute or two, and similarly a selection involving a small area of sky
such as:
SELECT * FROM USNOB WHERE RA BETWEEN 180 AND 181 AND DECLINATION BETWEEN 10 AND 10.1;
Was executed in just a few seconds, despite the table being able to use an index on only one spatial coordinate. We did not get accurate timing for these, as there did not seem to be any easy way to do this apart from using a stopwatch.
Performance on square searches seems to be on a par with, or slightly slower than, the "scat" search facility of WCSTools by Doug Mink.
Licences
Sybase IQ is normally licenced per cpu, with no limit on the number of clients or users. The standard price is £30,000 per cpu (plus VAT). Various discounts may be available, especially if we purchase licences for many cpus, or for the
AstroGrid project as a whole (likely to cost a million or two). Software support is presumably additional to this, but we did not discuss it.
Comments and Conclusions
Data loading from text files seems quite fast, though it is a pity that there is no direct way of ingesting binary data. Index creation seems very fast. Querying on a few columns is also very impressive, perhaps an order of magnitude faster than using row-oriented DBMS. Of course we managed to get a speed-up of over an order of magnitude during our DBMS evaluations simply by using a rather simple FITS-table based data format with column-orientation, using software knocked together in a couple of hours, but that is rather different from having a complete mature product which supports this feature.
The lack of a spatial index is a serious flaw for astronomical use in my opinion, and rather surprising, since most leading relational DBMS now include some spatial indexing method, even if only as an extra-cost option. The small box query which took a few seconds could probably be reduced to a few milliseonds with a suitable R-tree index. With a single cone or box search such inefficiency perhaps does not matter much, but a JOIN between tables based on celestial positions is likely to be awfully slow without a genuine spatial index. The Sybase reps suggested that, if we invested seriously in the product, the company might be willing to write additional code to support spatial queries.
Astronomers are likely to need user-defined functions (such as great-circle disance, HTM or HEALPix index calculation) within their SQL. In other DBMS the recommended way of
writing them is to use a fully-compiled language such as C; if Sybase IQ only allows Java, the overheads of the JVM may reduce the preformance of queries using such functions.
Licencing costs are very high by our standards, but the same is true for many software products which lack a wide user-base. The lack of a version for Linux also puts up likely costs of implementation, since suitable Sun servers are rather more expensive.
Alternatives
There are many other companies with OLAP products, including the big players in the database world such as Oracle and IBM, and many of them provide a range of alterntive indexing methods, especially bitmap indexing. I have not come across any of the major products claiming to use column-oriented table storage. After a bit of googling and searches on the comp.databases.olap I came up with two products which do: Sand from
Sand, and
Kdd but neither of these seems to be quite comparable to Sybase IQ.
Further information
There is some information on the site of
Sybase and also in the newsgroup sybase.public.iq
Many Sybase manuals are on-line in
manuals.sybase.com.
--
ClivePage - 10 Apr 2003