Column store indexes arrived in the enterprise edition of SQL Server 2012. Two of the limitations are as follows:
Only nonclustered column store indexes can be defined (which means you still need to allocate storage space for the underlying heap or clustered B tree data) and
Once a nonclustered column store index has been defined the parent table or partition can only be used for read-only purposes.
Both of these limitations are rumored to be lifted in the next version of the product.
Putting these and many other technical limitations aside, purely on the licensing front many businesses simply won’t want to incur the huge cost of the enterprise edition for a feature they expect to see as core for a modern database management system. Granted the current product offers us the best of both worlds, allowing us to combine row based data with column based data but will that be enough of a selling point ? Particularly when many other vendors are offering pure column based storage with high peformance methods (similar to bulk copy/insert) to import data directly into their read/write column stores.
One of the many vendors in the column store space is Actian Corporation (formerly Ingres Corp.) (http://www.actian.com/). Their Vectorwise analytic database product is well ranked in the TPC-H Top Ten Performance Results (http://www.tpc.org/tpch/results/tpch_perf_results.asp). From a SQL Server perspective it’s worth comparing this product’s capabilities with column store in SQL Server 2012. To this end the free 30 day trial Windows 64 bit version was downloaded and installed on a virtual machine configured with 10GB disk space, 8GB RAM and 2 virtual CPUs. The download is available at http://www.actian.com/downloads/vectorwise#trial. One of the prerequisites for one of the graphical administration tools is the Java Run Time environment so this was downloaded and installed as well.
Coming from an Ingres heritage it was pleasing to see that ANSI SQL is used as the standard query language for Vectorwise. The Actian Director GUI (equivalent of MS SQL Management Studio) was used to create a test database, create a table and populate it with 46+ million rows. The table structures used were the same as the ones posted by Gavin Payne (Microsoft Certified Architect, Microsoft Certified Master at Coeo, UK) at this link:
create table tblfact (custid integer not null, prodid integer not null, sale money not null) with structure=vectorwise create table dimprod (prodid integer not null,prodname varchar(200) not null) with structure=vectorwise insert into dimprod (prodid, prodname) values (1, 'Car'),(2, 'Van'),(3, 'Truck'), (4, 'Taxi'), (5, 'Horse'), (6, 'Helicopter'), (7, 'BizJet'),(8, 'JumboJet'),(9, 'Ferry'), (10, 'Hovercraft') -- 100 different customers, 10 different productions, sales in the range 1 to 1000000 -- This insert was run manually ten times... insert into tblfact (custid, prodid, sale) select (randomf()*100)+1, (randomf()*10)+1,(randomf()*1000000)+1 -- ...and then this was run multiple times to 'bulk up' the volume to 46,137,344 rows insert into tblfact (custid, prodid, sale) select (randomf()*100)+1, (randomf()*10)+1,(randomf()*10000)+1 from tblfact -- Running this query returned the sum of sales for all ten products in less than 2 seconds select dimprod.prodname, sum(tblfact.sale),count(*) from dimprod inner join tblfact on tblfact.prodid = dimprod.prodid group by dimprod.prodname -- Left joins (although not required in this case) also work (In SQL Server the columnstore index is used although row mode as opposed to batch mode processing is used in the execution plan) select dimprod.prodname,sum(tblfact.sale),count(*) from dimprod left join tblfact on tblfact.prodid = dimprod.prodid group by dimprod.prodname
The on-disk size of the column stores were as follows:
Directory of C:\Program Files\Actian\VectorwiseVW\ingres\data\vectorwise\testingdb\CBM\default
27/01/2013 07:35 <DIR> .
27/01/2013 07:35 <DIR> ..
26/01/2013 19:04 64 db_key
27/01/2013 00:54 4,096 _serveradminSdimprod__prodid_00000036
27/01/2013 00:54 4,096 _serveradminSdimprod__prodname_00000037
27/01/2013 00:50 44,515,328 _serveradminStblfact__custid_00000033
27/01/2013 00:50 28,516,352 _serveradminStblfact__prodid_00000034
27/01/2013 00:50 373,329,920 _serveradminStblfact__sale_00000035
6 File(s) 446,369,856 bytes
So 46,137,344 rows each of size 16 bytes = 704MB raw data compressed down to 426MB of column store data files. From the file names you can tell which columns have the largest size. Not surprisingly the sale column is the largest and the product column with only a spread of ten products is the smallest.
Both JDBC and ODBC can be used to query the database. A very rich and broad range of SQL constructs can be used to query the database including, for example, the windowing analytic functions.
To compare with SQL Server 2012 the same tables were created and populated with same number of rows. One minor difference was that a clustered primary key on a rowid identity column was added to the tblfact table. This was done to minimise the size of the column store indexes (heap table produce larger column store indexes than clustered B trees – see ‘Why is a columnstore index built from a heap larger than a columnstore index built on the same data from a clustered B-tree?’ at http://social.technet.microsoft.com/wiki/contents/articles/3540.sql-server-columnstore-index-faq-en-us.aspx )
The following ‘bulk up’ query was used to fill the tblfact table:
insert into tblfact (custid, prodid, sale)
SELECT TOP 2000000
(ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT))%100)+1,
(ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT))%10)+1,
(ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT))%1000000)+1
The compressed size (using data_compression=page) of this tblfact table was 686MB. The columnstore index on (custid, prodid, sale) was 261MB.
— Queries to fetch column store size
and object_id = object_id(‘tblfact’)
from sys.column_store_segments css
inner join sys.partitions p
on css.hobt_id = p.hobt_id
where p.object_id = object_id(‘tblfact’)
— and column_id = 4
Outside the Microsoft world there are many products available that do blazingly fast business analytics with multi-billion row data stored purely as column stores. Many of these products have column store as a core feature and not a costly enterprise feature. With the current limitations of the functionality, with the next release some way off and with other vendors with feature rich products already available it makes sense for businesses to evaluate all options.