Archive for category Column store

Column store storage for perfmon data

Existing solutions for storing perfmon data in a SQL Server database typically use traditional b-tree type tables to house the data (for example, the PAL tool and RELOG.EXE when the destination is a SQL table). This article shows a very simple schema for storing perfmon data in a column store table.

Because perfmon data mostly consists of well defined repeating values it is an ideal candidate for compressed storage. Depending on the sampling period the amount of data being collected is likely to be very high so a high capacity data storage structure that can be queried very quickly is also desirable. Lastly, because the data is not modified after it is collected it appears as read-only data within the database. All these attributes are a perfect match for column store storage.

Here’s one possible schema that can be used for storing perfmon data in a column store table:

CREATE TABLE [dbo].[factCounterData](
[ServerID] [int] NOT NULL,
[DateID] [int] NOT NULL,
[TimeID] [int] NOT NULL,
[CounterID] [int] NOT NULL,
[CounterValue] [float] NOT NULL
)
GO

CREATE CLUSTERED COLUMNSTORE INDEX [cci_factCounterData] ON [dbo].[factCounterData] WITH (DROP_EXISTING = OFF)
GO

The are various ways in which you may be collecting perfmon data. This article won’t discuss these options but will assume the data is present in a series of perfmon BLG files. These files can be imported into a database in a number of ways but the simplest method is to use the RELOG command. You create a system ODBC data source name (DSN) and specify this when running the command. For example,

RELOG.EXE \\MyServerName\d$\perflogs\admin\perfdata*.blg -f SQL -o SQL:PerfmonDb!Win2012r2

For this example, the machine that runs RELOG.EXE has an ODBC system DSN named PerfmonDb that specifies the target database.

On first execution RELOG.EXE creates these tables in the target database:

CREATE TABLE [dbo].[CounterData](
[GUID] [uniqueidentifier] NOT NULL,
[CounterID] [int] NOT NULL,
[RecordIndex] [int] NOT NULL,
[CounterDateTime] [char](24) NOT NULL,
[CounterValue] [float] NOT NULL,
[FirstValueA] [int] NULL,
[FirstValueB] [int] NULL,
[SecondValueA] [int] NULL,
[SecondValueB] [int] NULL,
[MultiCount] [int] NULL,
PRIMARY KEY CLUSTERED
(
[GUID] ASC,
[CounterID] ASC,
[RecordIndex] ASC
)
)
GO

CREATE TABLE [dbo].[CounterDetails](
[CounterID] [int] IDENTITY(1,1) NOT NULL,
[MachineName] [varchar](1024) NOT NULL,
[ObjectName] [varchar](1024) NOT NULL,
[CounterName] [varchar](1024) NOT NULL,
[CounterType] [int] NOT NULL,
[DefaultScale] [int] NOT NULL,
[InstanceName] [varchar](1024) NULL,
[InstanceIndex] [int] NULL,
[ParentName] [varchar](1024) NULL,
[ParentObjectID] [int] NULL,
[TimeBaseA] [int] NULL,
[TimeBaseB] [int] NULL,
PRIMARY KEY CLUSTERED
(
[CounterID] ASC
)
)
GO

CREATE TABLE [dbo].[DisplayToID](
[GUID] [uniqueidentifier] NOT NULL,
[RunID] [int] NULL,
[DisplayString] [varchar](1024) NOT NULL,
[LogStartTime] [char](24) NULL,
[LogStopTime] [char](24) NULL,
[NumberOfRecords] [int] NULL,
[MinutesToUTC] [int] NULL,
[TimeZoneName] [char](32) NULL,
PRIMARY KEY CLUSTERED
(
[GUID] ASC
),
UNIQUE NONCLUSTERED
(
[DisplayString] ASC
)
)
GO

Subsequent executions of the RELOG.EXE command append to the data that exists in the dbo.CounterData table. Sample data from each of the tables is as follows:

DisplayToID:

colstoreperfmon1

CounterDetails:

colstoreperfmon2

CounterData:

colstoreperfmon3

Now if you forget the column store table and simply query the three tables that RELOG.EXE creates you would have a short term workable solution for your perfmon reporting needs. However, as the volume of data in CounterData grows you will find that your storage requirements grow and your reporting queries take longer to run.

In data warehousing terms CounterData is the fact table. This will become very large. The two other tables are the dimension tables. These will never become really large. The column store solution will focus on the CounterData table.

The columns of interest in CounterData are GUID (link to DisplayToID dimension table which holds the DisplayString name – the server name), CounterID (link to CounterDetail dimension table which holds the counter names), CounterDateTime and CounterValue. The column store table will adopt best practice and store as many of the dimension links as integer columns. You can see from the structure of the factCounterData table at the start of this article that the GUID is not stored. Instead an integer ServerID is stored. Also, the CounterDateTime value is split into DateID and TimeID. Here’s the code that populates the column store table:

 

-- Dimension table to hold the GUID to ServerID mapping
CREATE TABLE [dbo].[dimServer](
[ServerID] [int] IDENTITY(1,1) NOT NULL,
[ServerName] [varchar](1024) NOT NULL,
[GUID] [uniqueidentifier] NOT NULL,
CONSTRAINT [PK_dimServer] PRIMARY KEY CLUSTERED
(
[ServerID] ASC
)
)
-- Routine to populate dimServer. Insert row if it's the first time we're seeing the DisplayString
CREATE PROCEDURE [dbo].[procInsertDimServer] AS
BEGIN
SET NOCOUNT ON

INSERT INTO dbo.dimServer (ServerName,GUID)
SELECT DisplayString,GUID
FROM dbo.DisplayToID d
WHERE NOT EXISTS (SELECT * FROM dbo.dimServer s WHERE s.ServerName = d.DisplayString)
END
CREATE PROCEDURE [dbo].[procInsertFactCounterData]
AS
BEGIN
SET NOCOUNT ON

-- Insert the server row if we've not inserted it during previous executions
EXEC dbo.procInsertDimServer

-- Now insert the fact data. Split the datetime value into separate date and time columns
INSERT INTO dbo.factCounterData (
                ServerID,
                DateID,
                TimeID,
                CounterID,
                CounterValue
)
SELECT
                s.ServerID,
                CAST(REPLACE(LEFT(c.CounterDateTime,10),'-','') AS int),
                DATEDIFF(second,'00:00:00',SUBSTRING(c.CounterDateTime,12,8)),
                c.CounterID,
                c.CounterValue
FROM dbo.CounterData c
INNER JOIN dbo.dimServer s
ON c.GUID = s.GUID
END
-- CounterData is now just a staging table so can be cleared down before running RELOG.EXE
CREATE PROCEDURE [dbo].[procClearDownCounterData]
AS
BEGIN
SET NOCOUNT ON
TRUNCATE TABLE dbo.CounterData
END

With this code in place the typical execution steps are

  1. Execute dbo.procClearDownCounterData
  2. Run the RELOG.EXE command to populate CounterData, CounterDetails and DisplayToID
  3. Execute dbo.procInsertFactCounterData

In terms of storage costs, for one server capturing one hundred different performance counters every two seconds I found over 650 million rows in the factCounterData occupied only two gigabytes of database space. The compression is dependent on the number of repeating values but you should fine impressive storage reductions over conventional b-tree storage.

In terms of reporting, here’s a parameterised stored procedure that can be used to query a server for a specified counter. Start and end time ranges can be supplied when calling the procedure:


USE [perfmondb]
GO

/****** Object:  StoredProcedure [dbo].[procGetCounterValues]    Script Date: 13/01/2015 23:23:16 ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE PROCEDURE [dbo].[procGetCounterValues]
@ServerName varchar(1024),
@ObjectName varchar(1024) = NULL,
@CounterName varchar(1024) = NULL,
@InstanceName varchar(1024) = NULL,
@Start datetime = NULL,
@End datetime = NULL
AS
BEGIN

SET NOCOUNT ON

DECLARE @StartDateID int = CONVERT(char(8),@Start,112)
DECLARE @StartTimeID int = DATEDIFF(second,'00:00:00',CAST(@Start AS time))
DECLARE @EndDateID   int = CONVERT(char(8),@End,112)
DECLARE @EndTimeID   int = DATEDIFF(second,'00:00:00',CAST(@End AS time))

SELECT
s.ServerName,
DATEADD(second,f.TimeID,CAST(f.DateID AS char(8))) AS CounterTime,
c.ObjectName,
c.CounterName,
c.InstanceName,
f.CounterValue
FROM dbo.factCounterData f
INNER JOIN dbo.dimServer s
ON s.ServerID = f.ServerID
INNER JOIN dbo.CounterDetails c
ON c.CounterID = f.CounterID
WHERE s.ServerName = @ServerName
AND (@ObjectName IS NULL OR c.ObjectName = @ObjectName)
AND (@CounterName IS NULL OR c.CounterName = @CounterName)
AND (@InstanceName IS NULL OR c.InstanceName = @InstanceName)
AND (
@StartDateID IS NULL
OR f.DateID > @StartDateID
OR (f.DateID = @StartDateID AND f.TimeID >= @StartTimeID)
)
AND (
@EndDateID IS NULL
OR f.DateID < @EndDateID
OR (f.DateID = @EndDateID AND f.TimeID <= @EndTimeID)
)
ORDER BY CounterTime
END

GO

An example call:

EXECUTE [dbo].[procGetCounterValues]
@ServerName = ‘Win2012r2’,
@CounterName = ‘% Processor Time’,
@InstanceName = ‘_Total’,
@Start = ‘20141130 16:46:41’,
@End = ‘20141130 16:46:59’

Output is this:

colstoreperfmon4

So ten rows out of a total of over one million rows for my data set. And the plan showed batch mode execution for the far right column store index scan:

colstoreperfmon5

 

Even querying the table when it held over 650 million rows I found my queries returned in less than three seconds.

So storage space and query response is optimised when using column store storage for performance monitor data.

Advertisements

Leave a comment

Column store solutions

Column store indexes arrived in the enterprise edition of SQL Server 2012. Two of the limitations are as follows:

Only nonclustered column store indexes can be defined (which means you still need to allocate storage space for the underlying heap or clustered B tree data) and

Once a nonclustered column store index has been defined the parent table or partition can only be used for read-only purposes.

Both of these limitations are rumored to be lifted in the next version of the product.

Putting these and many other technical limitations aside, purely on the licensing front many businesses simply won’t want to incur the huge cost of the enterprise edition for a feature they expect to see as core for a modern database management system. Granted the current product offers us the best of both worlds, allowing us to combine row based data with column based data but will that be enough of a selling point ? Particularly when many other vendors are offering pure column based storage with high peformance methods (similar to bulk copy/insert) to import data directly into their read/write column stores.

One of the many vendors in the column store space is Actian Corporation (formerly Ingres Corp.) (http://www.actian.com/). Their Vectorwise analytic database product is well ranked in the TPC-H Top Ten Performance Results (http://www.tpc.org/tpch/results/tpch_perf_results.asp). From a SQL Server perspective it’s worth comparing this product’s capabilities with column store in SQL Server 2012. To this end the free 30 day trial Windows 64 bit version was downloaded and installed on a virtual machine configured with 10GB disk space, 8GB RAM and 2 virtual CPUs. The download is available at http://www.actian.com/downloads/vectorwise#trial. One of the prerequisites for one of the graphical administration tools is the Java Run Time environment so this was downloaded and installed as well.

Coming from an Ingres heritage it was pleasing to see that ANSI SQL is used as the standard query language for Vectorwise. The Actian Director GUI (equivalent of MS SQL Management Studio) was used to create a test database, create a table and populate it with 46+ million rows. The table structures used were the same as the ones posted by Gavin Payne (Microsoft Certified Architect, Microsoft Certified Master at Coeo, UK) at this link:

http://gavinpayneuk.com/2012/07/22/this-is-columnstore-part-1/

create table tblfact (custid integer not null, prodid integer not null, sale money not null) with structure=vectorwise

create table dimprod (prodid integer not null,prodname varchar(200) not null) with structure=vectorwise

insert into dimprod (prodid, prodname) 
values 
(1, 'Car'),(2, 'Van'),(3, 'Truck'), 
(4, 'Taxi'), (5, 'Horse'), (6, 'Helicopter'), 
(7, 'BizJet'),(8, 'JumboJet'),(9, 'Ferry'), 
(10, 'Hovercraft')

-- 100 different customers, 10 different productions, sales in the range 1 to 1000000 
-- This insert was run manually ten times...
insert into tblfact (custid, prodid, sale) 
select (randomf()*100)+1, (randomf()*10)+1,(randomf()*1000000)+1

-- ...and then this was run multiple times to 'bulk up' the volume to 46,137,344 rows
insert into tblfact (custid, prodid, sale) 
select (randomf()*100)+1, (randomf()*10)+1,(randomf()*10000)+1
from tblfact

-- Running this query returned the sum of sales for all ten products in less than 2 seconds
select dimprod.prodname, sum(tblfact.sale),count(*) 
from dimprod
inner join tblfact
on tblfact.prodid = dimprod.prodid 
group by dimprod.prodname

-- Left joins (although not required in this case) also work (In SQL Server the columnstore index is used although row mode as opposed to batch mode processing is used in the execution plan)
select dimprod.prodname,sum(tblfact.sale),count(*)
from dimprod
left join tblfact
on tblfact.prodid = dimprod.prodid 
group by dimprod.prodname

Vectorwise querying

The on-disk size of the column stores were as follows:

Directory of C:\Program Files\Actian\VectorwiseVW\ingres\data\vectorwise\testingdb\CBM\default

27/01/2013  07:35    <DIR>          .
27/01/2013  07:35    <DIR>          ..
26/01/2013  19:04                64 db_key
27/01/2013  00:54             4,096 _serveradminSdimprod__prodid_00000036
27/01/2013  00:54             4,096 _serveradminSdimprod__prodname_00000037
27/01/2013  00:50        44,515,328 _serveradminStblfact__custid_00000033
27/01/2013  00:50        28,516,352 _serveradminStblfact__prodid_00000034
27/01/2013  00:50       373,329,920 _serveradminStblfact__sale_00000035
6 File(s)    446,369,856 bytes

So 46,137,344 rows each of size 16 bytes = 704MB raw data compressed down to 426MB of column store data files. From the file names you can tell which columns have the largest size. Not surprisingly the sale column is the largest and the product column with only a spread of ten products is the smallest.

Both JDBC and ODBC can be used to query the database. A very rich and broad range of SQL constructs can be used to query the database including, for example, the windowing analytic functions.

To compare with SQL Server 2012 the same tables were created and populated with same number of rows. One minor difference was that a clustered primary key on a rowid identity column was added to the tblfact table. This was done to minimise the size of the column store indexes (heap table produce larger column store indexes than clustered B trees – see ‘Why is a columnstore index built from a heap larger than a columnstore index built on the same data from a clustered B-tree?’ at http://social.technet.microsoft.com/wiki/contents/articles/3540.sql-server-columnstore-index-faq-en-us.aspx )

The following ‘bulk up’ query was used to fill the tblfact table:

insert into tblfact (custid, prodid, sale)
SELECT TOP 2000000
(ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT))%100)+1,
(ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT))%10)+1,
(ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT))%1000000)+1
from tblfact

The compressed size (using data_compression=page) of this tblfact table was 686MB. The columnstore index on (custid, prodid, sale) was  261MB.

— Queries to fetch column store size

select lob_reserved_page_count/128.0
from sys.dm_db_partition_stats
where index_id<>1
and object_id = object_id(‘tblfact’)

select sum(css.on_disk_size)/(1024*1024.0)
from sys.column_store_segments css
inner join sys.partitions p
on css.hobt_id = p.hobt_id
where p.object_id = object_id(‘tblfact’)
— and column_id = 4

SQL Server columnstore

SQL Server columnstore

Outside the Microsoft world there are many products available that do blazingly fast business analytics with multi-billion row data stored purely as column stores. Many of these products have column store as a core feature and not a costly enterprise feature. With the current limitations of the functionality, with the next release some way off and with other vendors with feature rich products already available it makes sense for businesses to evaluate all options.

Leave a comment