Archive for category Column store
Existing solutions for storing perfmon data in a SQL Server database typically use traditional b-tree type tables to house the data (for example, the PAL tool and RELOG.EXE when the destination is a SQL table). This article shows a very simple schema for storing perfmon data in a column store table.
Because perfmon data mostly consists of well defined repeating values it is an ideal candidate for compressed storage. Depending on the sampling period the amount of data being collected is likely to be very high so a high capacity data storage structure that can be queried very quickly is also desirable. Lastly, because the data is not modified after it is collected it appears as read-only data within the database. All these attributes are a perfect match for column store storage.
Here’s one possible schema that can be used for storing perfmon data in a column store table:
CREATE TABLE [dbo].[factCounterData]( [ServerID] [int] NOT NULL, [DateID] [int] NOT NULL, [TimeID] [int] NOT NULL, [CounterID] [int] NOT NULL, [CounterValue] [float] NOT NULL ) GO CREATE CLUSTERED COLUMNSTORE INDEX [cci_factCounterData] ON [dbo].[factCounterData] WITH (DROP_EXISTING = OFF) GO
The are various ways in which you may be collecting perfmon data. This article won’t discuss these options but will assume the data is present in a series of perfmon BLG files. These files can be imported into a database in a number of ways but the simplest method is to use the RELOG command. You create a system ODBC data source name (DSN) and specify this when running the command. For example,
RELOG.EXE \\MyServerName\d$\perflogs\admin\perfdata*.blg -f SQL -o SQL:PerfmonDb!Win2012r2
For this example, the machine that runs RELOG.EXE has an ODBC system DSN named PerfmonDb that specifies the target database.
On first execution RELOG.EXE creates these tables in the target database:
CREATE TABLE [dbo].[CounterData]( [GUID] [uniqueidentifier] NOT NULL, [CounterID] [int] NOT NULL, [RecordIndex] [int] NOT NULL, [CounterDateTime] [char](24) NOT NULL, [CounterValue] [float] NOT NULL, [FirstValueA] [int] NULL, [FirstValueB] [int] NULL, [SecondValueA] [int] NULL, [SecondValueB] [int] NULL, [MultiCount] [int] NULL, PRIMARY KEY CLUSTERED ( [GUID] ASC, [CounterID] ASC, [RecordIndex] ASC ) ) GO CREATE TABLE [dbo].[CounterDetails]( [CounterID] [int] IDENTITY(1,1) NOT NULL, [MachineName] [varchar](1024) NOT NULL, [ObjectName] [varchar](1024) NOT NULL, [CounterName] [varchar](1024) NOT NULL, [CounterType] [int] NOT NULL, [DefaultScale] [int] NOT NULL, [InstanceName] [varchar](1024) NULL, [InstanceIndex] [int] NULL, [ParentName] [varchar](1024) NULL, [ParentObjectID] [int] NULL, [TimeBaseA] [int] NULL, [TimeBaseB] [int] NULL, PRIMARY KEY CLUSTERED ( [CounterID] ASC ) ) GO CREATE TABLE [dbo].[DisplayToID]( [GUID] [uniqueidentifier] NOT NULL, [RunID] [int] NULL, [DisplayString] [varchar](1024) NOT NULL, [LogStartTime] [char](24) NULL, [LogStopTime] [char](24) NULL, [NumberOfRecords] [int] NULL, [MinutesToUTC] [int] NULL, [TimeZoneName] [char](32) NULL, PRIMARY KEY CLUSTERED ( [GUID] ASC ), UNIQUE NONCLUSTERED ( [DisplayString] ASC ) ) GO
Subsequent executions of the RELOG.EXE command append to the data that exists in the dbo.CounterData table. Sample data from each of the tables is as follows:
Now if you forget the column store table and simply query the three tables that RELOG.EXE creates you would have a short term workable solution for your perfmon reporting needs. However, as the volume of data in CounterData grows you will find that your storage requirements grow and your reporting queries take longer to run.
In data warehousing terms CounterData is the fact table. This will become very large. The two other tables are the dimension tables. These will never become really large. The column store solution will focus on the CounterData table.
The columns of interest in CounterData are GUID (link to DisplayToID dimension table which holds the DisplayString name – the server name), CounterID (link to CounterDetail dimension table which holds the counter names), CounterDateTime and CounterValue. The column store table will adopt best practice and store as many of the dimension links as integer columns. You can see from the structure of the factCounterData table at the start of this article that the GUID is not stored. Instead an integer ServerID is stored. Also, the CounterDateTime value is split into DateID and TimeID. Here’s the code that populates the column store table:
-- Dimension table to hold the GUID to ServerID mapping CREATE TABLE [dbo].[dimServer]( [ServerID] [int] IDENTITY(1,1) NOT NULL, [ServerName] [varchar](1024) NOT NULL, [GUID] [uniqueidentifier] NOT NULL, CONSTRAINT [PK_dimServer] PRIMARY KEY CLUSTERED ( [ServerID] ASC ) )
-- Routine to populate dimServer. Insert row if it's the first time we're seeing the DisplayString CREATE PROCEDURE [dbo].[procInsertDimServer] AS BEGIN SET NOCOUNT ON INSERT INTO dbo.dimServer (ServerName,GUID) SELECT DisplayString,GUID FROM dbo.DisplayToID d WHERE NOT EXISTS (SELECT * FROM dbo.dimServer s WHERE s.ServerName = d.DisplayString) END
CREATE PROCEDURE [dbo].[procInsertFactCounterData] AS BEGIN SET NOCOUNT ON -- Insert the server row if we've not inserted it during previous executions EXEC dbo.procInsertDimServer -- Now insert the fact data. Split the datetime value into separate date and time columns INSERT INTO dbo.factCounterData ( ServerID, DateID, TimeID, CounterID, CounterValue ) SELECT s.ServerID, CAST(REPLACE(LEFT(c.CounterDateTime,10),'-','') AS int), DATEDIFF(second,'00:00:00',SUBSTRING(c.CounterDateTime,12,8)), c.CounterID, c.CounterValue FROM dbo.CounterData c INNER JOIN dbo.dimServer s ON c.GUID = s.GUID END
-- CounterData is now just a staging table so can be cleared down before running RELOG.EXE CREATE PROCEDURE [dbo].[procClearDownCounterData] AS BEGIN SET NOCOUNT ON TRUNCATE TABLE dbo.CounterData END
With this code in place the typical execution steps are
- Execute dbo.procClearDownCounterData
- Run the RELOG.EXE command to populate CounterData, CounterDetails and DisplayToID
- Execute dbo.procInsertFactCounterData
In terms of storage costs, for one server capturing one hundred different performance counters every two seconds I found over 650 million rows in the factCounterData occupied only two gigabytes of database space. The compression is dependent on the number of repeating values but you should fine impressive storage reductions over conventional b-tree storage.
In terms of reporting, here’s a parameterised stored procedure that can be used to query a server for a specified counter. Start and end time ranges can be supplied when calling the procedure:
USE [perfmondb] GO /****** Object: StoredProcedure [dbo].[procGetCounterValues] Script Date: 13/01/2015 23:23:16 ******/ SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO CREATE PROCEDURE [dbo].[procGetCounterValues] @ServerName varchar(1024), @ObjectName varchar(1024) = NULL, @CounterName varchar(1024) = NULL, @InstanceName varchar(1024) = NULL, @Start datetime = NULL, @End datetime = NULL AS BEGIN SET NOCOUNT ON DECLARE @StartDateID int = CONVERT(char(8),@Start,112) DECLARE @StartTimeID int = DATEDIFF(second,'00:00:00',CAST(@Start AS time)) DECLARE @EndDateID int = CONVERT(char(8),@End,112) DECLARE @EndTimeID int = DATEDIFF(second,'00:00:00',CAST(@End AS time)) SELECT s.ServerName, DATEADD(second,f.TimeID,CAST(f.DateID AS char(8))) AS CounterTime, c.ObjectName, c.CounterName, c.InstanceName, f.CounterValue FROM dbo.factCounterData f INNER JOIN dbo.dimServer s ON s.ServerID = f.ServerID INNER JOIN dbo.CounterDetails c ON c.CounterID = f.CounterID WHERE s.ServerName = @ServerName AND (@ObjectName IS NULL OR c.ObjectName = @ObjectName) AND (@CounterName IS NULL OR c.CounterName = @CounterName) AND (@InstanceName IS NULL OR c.InstanceName = @InstanceName) AND ( @StartDateID IS NULL OR f.DateID > @StartDateID OR (f.DateID = @StartDateID AND f.TimeID >= @StartTimeID) ) AND ( @EndDateID IS NULL OR f.DateID < @EndDateID OR (f.DateID = @EndDateID AND f.TimeID <= @EndTimeID) ) ORDER BY CounterTime END GO
An example call:
@ServerName = ‘Win2012r2’,
@CounterName = ‘% Processor Time’,
@InstanceName = ‘_Total’,
@Start = ‘20141130 16:46:41’,
@End = ‘20141130 16:46:59’
Output is this:
So ten rows out of a total of over one million rows for my data set. And the plan showed batch mode execution for the far right column store index scan:
Even querying the table when it held over 650 million rows I found my queries returned in less than three seconds.
So storage space and query response is optimised when using column store storage for performance monitor data.
Column store indexes arrived in the enterprise edition of SQL Server 2012. Two of the limitations are as follows:
Only nonclustered column store indexes can be defined (which means you still need to allocate storage space for the underlying heap or clustered B tree data) and
Once a nonclustered column store index has been defined the parent table or partition can only be used for read-only purposes.
Both of these limitations are rumored to be lifted in the next version of the product.
Putting these and many other technical limitations aside, purely on the licensing front many businesses simply won’t want to incur the huge cost of the enterprise edition for a feature they expect to see as core for a modern database management system. Granted the current product offers us the best of both worlds, allowing us to combine row based data with column based data but will that be enough of a selling point ? Particularly when many other vendors are offering pure column based storage with high peformance methods (similar to bulk copy/insert) to import data directly into their read/write column stores.
One of the many vendors in the column store space is Actian Corporation (formerly Ingres Corp.) (http://www.actian.com/). Their Vectorwise analytic database product is well ranked in the TPC-H Top Ten Performance Results (http://www.tpc.org/tpch/results/tpch_perf_results.asp). From a SQL Server perspective it’s worth comparing this product’s capabilities with column store in SQL Server 2012. To this end the free 30 day trial Windows 64 bit version was downloaded and installed on a virtual machine configured with 10GB disk space, 8GB RAM and 2 virtual CPUs. The download is available at http://www.actian.com/downloads/vectorwise#trial. One of the prerequisites for one of the graphical administration tools is the Java Run Time environment so this was downloaded and installed as well.
Coming from an Ingres heritage it was pleasing to see that ANSI SQL is used as the standard query language for Vectorwise. The Actian Director GUI (equivalent of MS SQL Management Studio) was used to create a test database, create a table and populate it with 46+ million rows. The table structures used were the same as the ones posted by Gavin Payne (Microsoft Certified Architect, Microsoft Certified Master at Coeo, UK) at this link:
create table tblfact (custid integer not null, prodid integer not null, sale money not null) with structure=vectorwise create table dimprod (prodid integer not null,prodname varchar(200) not null) with structure=vectorwise insert into dimprod (prodid, prodname) values (1, 'Car'),(2, 'Van'),(3, 'Truck'), (4, 'Taxi'), (5, 'Horse'), (6, 'Helicopter'), (7, 'BizJet'),(8, 'JumboJet'),(9, 'Ferry'), (10, 'Hovercraft') -- 100 different customers, 10 different productions, sales in the range 1 to 1000000 -- This insert was run manually ten times... insert into tblfact (custid, prodid, sale) select (randomf()*100)+1, (randomf()*10)+1,(randomf()*1000000)+1 -- ...and then this was run multiple times to 'bulk up' the volume to 46,137,344 rows insert into tblfact (custid, prodid, sale) select (randomf()*100)+1, (randomf()*10)+1,(randomf()*10000)+1 from tblfact -- Running this query returned the sum of sales for all ten products in less than 2 seconds select dimprod.prodname, sum(tblfact.sale),count(*) from dimprod inner join tblfact on tblfact.prodid = dimprod.prodid group by dimprod.prodname -- Left joins (although not required in this case) also work (In SQL Server the columnstore index is used although row mode as opposed to batch mode processing is used in the execution plan) select dimprod.prodname,sum(tblfact.sale),count(*) from dimprod left join tblfact on tblfact.prodid = dimprod.prodid group by dimprod.prodname
The on-disk size of the column stores were as follows:
Directory of C:\Program Files\Actian\VectorwiseVW\ingres\data\vectorwise\testingdb\CBM\default
27/01/2013 07:35 <DIR> .
27/01/2013 07:35 <DIR> ..
26/01/2013 19:04 64 db_key
27/01/2013 00:54 4,096 _serveradminSdimprod__prodid_00000036
27/01/2013 00:54 4,096 _serveradminSdimprod__prodname_00000037
27/01/2013 00:50 44,515,328 _serveradminStblfact__custid_00000033
27/01/2013 00:50 28,516,352 _serveradminStblfact__prodid_00000034
27/01/2013 00:50 373,329,920 _serveradminStblfact__sale_00000035
6 File(s) 446,369,856 bytes
So 46,137,344 rows each of size 16 bytes = 704MB raw data compressed down to 426MB of column store data files. From the file names you can tell which columns have the largest size. Not surprisingly the sale column is the largest and the product column with only a spread of ten products is the smallest.
Both JDBC and ODBC can be used to query the database. A very rich and broad range of SQL constructs can be used to query the database including, for example, the windowing analytic functions.
To compare with SQL Server 2012 the same tables were created and populated with same number of rows. One minor difference was that a clustered primary key on a rowid identity column was added to the tblfact table. This was done to minimise the size of the column store indexes (heap table produce larger column store indexes than clustered B trees – see ‘Why is a columnstore index built from a heap larger than a columnstore index built on the same data from a clustered B-tree?’ at http://social.technet.microsoft.com/wiki/contents/articles/3540.sql-server-columnstore-index-faq-en-us.aspx )
The following ‘bulk up’ query was used to fill the tblfact table:
insert into tblfact (custid, prodid, sale)
SELECT TOP 2000000
(ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT))%100)+1,
(ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT))%10)+1,
(ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT))%1000000)+1
The compressed size (using data_compression=page) of this tblfact table was 686MB. The columnstore index on (custid, prodid, sale) was 261MB.
— Queries to fetch column store size
and object_id = object_id(‘tblfact’)
from sys.column_store_segments css
inner join sys.partitions p
on css.hobt_id = p.hobt_id
where p.object_id = object_id(‘tblfact’)
— and column_id = 4
Outside the Microsoft world there are many products available that do blazingly fast business analytics with multi-billion row data stored purely as column stores. Many of these products have column store as a core feature and not a costly enterprise feature. With the current limitations of the functionality, with the next release some way off and with other vendors with feature rich products already available it makes sense for businesses to evaluate all options.