Number of copies based retention for database backups

Ola Hallengren’s DatabaseBackup stored procedure implements retention of database backups based on the age of backups in hours. For example, if you specify the @CleanupTime parameter as 24 then when the next backup is run if the existing backup is older than 24 hours it will be deleted. If you want to retain two backups you would specify 48.

For certain cases a more natural way of specifying retention would be through number of copies. The attached version of Ola’s DatabaseBackup stored procedure implements this via the @CleanupCopies and @MirrorCleanupCopies parameters. The modified version of Ola’s stored procedure calls dbo.CleanupCopies to do the actual deletions:


CREATE PROCEDURE [dbo].[CleanupCopies]
@DirectoryPath nvarchar(4000),
@CleanupCopies int,
@NumberOfFiles int,
@NumberOfDirectories int
AS
BEGIN

SET NOCOUNT ON

DECLARE @RetainFiles int = @CleanupCopies * @NumberOfFiles / @NumberOfDirectories,
@CurrentFileID int,
@CurrentFileName nvarchar(4000),
@ReturnCode int = 0

DECLARE @CurrentFiles TABLE (ID int IDENTITY (1,1) PRIMARY KEY NOT NULL,
FileName varchar(4000) NULL,
Depth int NULL,
IsFile int NULL,
CleanupCompleted bit NULL,
CleanupOutput int NULL)

INSERT @CurrentFiles (FileName,Depth,IsFile)
EXEC [master].dbo.xp_dirtree @DirectoryPath,0,1

SET @ReturnCode = @@ERROR

IF @ReturnCode <> 0 RETURN @ReturnCode

DELETE FROM @CurrentFiles
WHERE ID IN (
SELECT TOP(@RetainFiles) ID
FROM @CurrentFiles
ORDER BY FileName DESC
)

UPDATE @CurrentFiles
SET CleanupCompleted = 0,
CleanupOutput = 0

-- Do the deletions
WHILE EXISTS (SELECT * FROM @CurrentFiles WHERE CleanupCompleted = 0)
BEGIN
SELECT TOP 1 @CurrentFileID = ID,
@CurrentFileName = @DirectoryPath + N'\' + FileName
FROM @CurrentFiles
WHERE CleanupCompleted = 0
ORDER BY ID ASC

EXECUTE @ReturnCode = [master].dbo.xp_delete_file 0, @CurrentFileName

UPDATE @CurrentFiles
SET CleanupCompleted = 1,
CleanupOutput = @ReturnCode
WHERE ID = @CurrentFileID

END

SELECT @ReturnCode = ISNULL(MAX(CleanupOutput),0) FROM @CurrentFiles

RETURN @ReturnCode

END

Caveats:

Retention based on number of copies is not available for transaction log backups. Retention based on age is best for these.

It only caters for native backups (ie not implemented for LiteSpeed, Red Gate and Idera backups). The reason being the solution makes use of xp_delete_file which only handles native backups.

 

Advertisements

Leave a comment

Table driven procedure parameter values

This posting shows one approach for the lookup of stored procedure parameter values. The example use case will be the very popular SQL Server Maintenance Solution authored by Ola Hallengren (https://ola.hallengren.com/).

The solution consists of three parameterised stored procedures, one for database backups, one for database integrity checks and one for index optimisations. These procedures can be run as SQL Agent TSQL job steps. The stored procedure calls along with their parameter values can be encoded in these job steps. But maintenance can become a bit of a problem.

Suppose you have many databases on your SQL Server. You have multiple backup job steps with all but one of the backup stored procedure parameters being different. They differ in terms of which directory they write their backups to. Suppose you are low on storage so a new volume has been assigned. You now need to identify and update those job steps that are referring to the close to full volumes and change the directory name parameter to the newly presented volume. This can be done manually using the GUI or you could hack the msdb.dbo.sysjobsteps table updating the command column. As another example suppose you have multiple job stops implementing different index optimisation parameters. Again you could refer to the sysjobsteps table.

A far better approach would be to store the parameter values in a lookup table. You would have three lookup tables, one for each of the maintenance operations – backup, integrity check and index optimisation. You list all the variations of the parameters as separate rows in these three tables with each row having a descriptive tag name. This tag name column be the primary key for that table. So a given tag name would refer to all the parameter values for an execution of that operation.

Taking the dbo.DatabaseBackup stored procedure as an example these are the parameters (as of the time of writing):


SELECT *
FROM INFORMATION_SCHEMA.PARAMETERS
WHERE SPECIFIC_CATALOG = db_name()
AND SPECIFIC_SCHEMA = 'dbo'
AND SPECIFIC_NAME = 'DatabaseBackup'
ORDER BY ORDINAL_POSITION

 

params1

The screenshot doesn’t show the complete list of parameters. There are 36 in total for this procedure. With the exception of @Databases (which will be dealt with separately later) we want to pivot so that PARAMETER_NAME values (with the leading @ symbol ignored) become our column names. In addition, the first column name will be named ParamTag and be the primary key. The second column will be a fixed value, the stored procedure name. Here’s a procedure that accomplishes this task:


/* First create a common 'parent' table for all three procedures. Storing the ParamTag value is sufficient for the data model */
/* This common table will be required when permitted operations for groups of databases is discussed */
  
CREATE TABLE dbo.OperationParams
(
OperationType sysname NOT NULL CONSTRAINT CK_OperationParams CHECK (OperationType IN ('DatabaseBackup','DatabaseIntegrityCheck','IndexOptimize')),
ParamTag nvarchar(255) NOT NULL,
CONSTRAINT PK_OperationParams PRIMARY KEY CLUSTERED (OperationType,ParamTag)
)
GO

CREATE PROCEDURE [dbo].[CreateOperationParamTable]
    @procname sysname
AS
BEGIN

DECLARE @sqlstmt nvarchar(max)

-- Build the column list first
SET @sqlstmt = (
    SELECT ',[' + STUFF(PARAMETER_NAME,1,1,'') + '] ' + DATA_TYPE +
    CASE 
    WHEN CHARACTER_MAXIMUM_LENGTH IS NULL THEN ''
    WHEN CHARACTER_MAXIMUM_LENGTH = -1 THEN '(max)'
    ELSE '(' + STR(CHARACTER_MAXIMUM_LENGTH) + ')'
    END
    + ' NULL'
    FROM INFORMATION_SCHEMA.PARAMETERS
    WHERE SPECIFIC_CATALOG = db_name()
    AND SPECIFIC_SCHEMA = 'dbo'
    AND SPECIFIC_NAME = @procname
    AND PARAMETER_NAME <> '@Databases'
    ORDER BY ORDINAL_POSITION
    FOR XML PATH('')
)

-- Add the create table and constraints
DECLARE @quotedtablename sysname = QUOTENAME(@procname + 'Params')
SET @sqlstmt = N'IF OBJECT_ID(''dbo.' + @quotedtablename + ''') IS NOT NULL DROP TABLE dbo.' + @quotedtablename + ';' + CHAR(13) + CHAR(10) +
    N'CREATE TABLE dbo.' + @quotedtablename + ' (' + CHAR(13) + CHAR(10) + 
    N'[ParamTag] nvarchar(255) NOT NULL CONSTRAINT ' + STUFF(@quotedtablename,2,0,'PK_') + N' PRIMARY KEY CLUSTERED,' + CHAR(13) + CHAR(10) + 
    N'[OperationType] sysname NOT NULL CONSTRAINT ' + STUFF(@quotedtablename,2,0,'CK_') + N' CHECK ([OperationType] IN (' + QUOTENAME(@procname,CHAR(39)+CHAR(39)) + N')) CONSTRAINT ' + STUFF(@quotedtablename,2,0,'DF_') + N' DEFAULT (' + QUOTENAME(@procname,CHAR(39)+CHAR(39)) + N')' + CHAR(13) + CHAR(10) + 
    @sqlstmt + CHAR(13) + CHAR(10) + 
    N',CONSTRAINT ' + STUFF(@quotedtablename,2,0,'FK_') + N' FOREIGN KEY ([OperationType],[ParamTag]) REFERENCES dbo.OperationParams ([OperationType],[ParamTag]))'

-- Run it
PRINT 'Running:' + CHAR(13) + CHAR(10) + @sqlstmt
EXEC sp_executesql @sqlstmt

END

The procedure is then called as follows to create the Params tables for each of Ola’s three stored procedures:

EXEC [dbo].[CreateOperationParamTable] 'DatabaseBackup'
EXEC [dbo].[CreateOperationParamTable] 'DatabaseIntegrityCheck'
EXEC [dbo].[CreateOperationParamTable] 'IndexOptimize'

/* Insert some param values */
INSERT INTO dbo.OperationParams VALUES ('DatabaseBackup','FullBackups')
INSERT INTO dbo.DatabaseBackupParams (ParamTag,Directory,BackupType,[Compress],[LogToTable]) values ('FullBackups','c:\dbadmin\backup','FULL','Y','Y')
INSERT INTO dbo.OperationParams VALUES ('DatabaseIntegrityCheck','CheckDBs')
INSERT INTO dbo.DatabaseIntegrityCheckParams (ParamTag,CheckCommands,[LogToTable]) values ('CheckDBs','CHECKDB','Y')
INSERT INTO dbo.OperationParams VALUES ('IndexOptimize','ReindexReorgs')
INSERT INTO dbo.IndexOptimizeParams (ParamTag,FragmentationLow,[LogToTable]) values ('ReindexReorgs','INDEX_REORGANIZE,INDEX_REBUILD_ONLINE,INDEX_REBUILD_OFFLINE','Y')

SELECT * FROM dbo.OperationParams
SELECT * FROM dbo.DatabaseBackupParams
SELECT * FROM dbo.DatabaseIntegrityCheckParams
SELECT * FROM dbo.IndexOptimizeParams

This results in creation of the DatabaseBackupParams table, the DatabaseIntegrityCheckParams table and the IndexOptimizeParams table:

params2

With the Databases parameter removed we now have parameter tagged rows that can be applied to one or more groups of databases. Ola’s solution allows the Databases parameter to accept keywords such as USER_DATABASES, SYSTEM_DATABASES or a comma separated list of database names. Having omitted the Databases column from the above three tables we know need these tables to allow us to define groups of databases and the permitted operations for the groups:


CREATE TABLE [dbo].[DatabaseGroupParent] (
    [GroupName] [nvarchar](40) NOT NULL,
 CONSTRAINT [PK_DatabaseGroupParent] PRIMARY KEY CLUSTERED ([GroupName] ASC)
) 
CREATE TABLE [dbo].[DatabaseGroup] (
    [GroupName] [nvarchar](40) NOT NULL CONSTRAINT [FK_DatabaseGroup] FOREIGN KEY REFERENCES [dbo].[DatabaseGroupParent] ([GroupName]),
    [DatabaseName] [sysname] NOT NULL,  
    CONSTRAINT [PK_DatabaseGroup] PRIMARY KEY CLUSTERED ([GroupName] ASC,[DatabaseName] ASC) 
) 

CREATE TABLE [dbo].[DatabaseGroupOperations] (     
    [GroupName] [nvarchar](40) NOT NULL CONSTRAINT [FK_DatabaseGroupOperations_1] FOREIGN KEY REFERENCES [dbo].[DatabaseGroupParent] ([GroupName]),
    [OperationType] [sysname] NOT NULL,     
    [ParamTag] [nvarchar](255) NOT NULL, 
    CONSTRAINT [PK_DatabaseGroupOperations] PRIMARY KEY CLUSTERED ([GroupName] ASC,[OperationType] ASC,[ParamTag] ASC),
    CONSTRAINT [FK_DatabaseGroupOperations_2] FOREIGN KEY ([OperationType], [ParamTag]) REFERENCES [dbo].[OperationParams] ([OperationType], [ParamTag])
) 
GO 

One issue with Ola’s solution is the default parameters for the IndexOptimize @FragmentationMedium and @FragmentationHigh parameters. If you don’t specify these parameters the values used are ‘INDEX_REORGANIZE,INDEX_REBUILD_ONLINE,INDEX_REBUILD_OFFLINE’ and ‘INDEX_REBUILD_ONLINE,INDEX_REBUILD_OFFLINE’ respectively. This compares to @FragmentationLow where if you do not specify this parameter the value used is NULL which signifies that reindex/reorganise is not used. NULL specified for any of these parameters specifies that reindex/reorganise is not used. In this table driven solution NULL is translated to DEFAULT when the store procedures are called. This presents problems for @FragmentationMedium and @FragmentationHigh. It’s not a problem for @FragmentationLow as NULL is the DEFAULT value. The workaround for @FragmentationMedium and @FragmentationHigh is to use a user defined keyword – for example, NO_INDEX_MAINT. Rather than hardcoding this string in the procedures the solution stores the following table:

CREATE TABLE dbo.OperationParamsNullKeyword
(
OperationType sysname NOT NULL,
ParamName sysname NOT NULL,
NullKeyword nvarchar(max) NOT NULL,
CONSTRAINT PK_OperationParamsNullKeyword PRIMARY KEY CLUSTERED (OperationType,ParamName)
)
GO

INSERT INTO dbo.OperationParamsNullKeyword
VALUES
('IndexOptimize','FragmentationMedium','NO_INDEX_MAINT'),
('IndexOptimize','FragmentationHigh','NO_INDEX_MAINT')
GO

In effect, the processing stored procedure will translate these keywords to ‘NULL’ and not ‘DEFAULT’. Ola has example ‘C. Update statistics on all user databases’ on his Index and Statistics page. The IndexOptimizeParams table containing

FragmentationLow    FragmentationMedium    FragmentationHigh         UpdateStatistics
NULL                            NO_INDEX_MAINT       NO_INDEX_MAINT       ALL

will be translated to

EXECUTE dbo.IndexOptimize
@Databases = ‘USER_DATABASES’,
@FragmentationLow = NULL,
@FragmentationMedium = NULL,
@FragmentationHigh = NULL,
@UpdateStatistics = ‘ALL’

 

Now on to the wrapper stored procedure that for a specified operation (backup, integrity check or index optimize), a specified group name or specific database name(s) or one of Ola’s database keywords and for a specific operation parameter tag runs Ola’s actual stored procedure. Here’s the stored procedure that does that:


/* Helper stored procedure that creates a string such as ",@Directory='c:\backups',@BackupType='FULL',@Verify='Y'" depending on your actual values in the corresponding Params table */

CREATE PROCEDURE [dbo].[GetOperationParams]
@Operation sysname,
@OperationParamTag nvarchar(255),
@OperationParams nvarchar(max) OUTPUT
AS
BEGIN

DECLARE @sqlstmt nvarchar(max)

SET @sqlstmt = STUFF(
    (
    SELECT '+'',@'+COLUMN_NAME+'=''+COALESCE('+
    CASE WHEN ParamName IS NULL THEN '' ELSE
    'CASE WHEN ['+COLUMN_NAME+']='''+NullKeyword+''' THEN ''NULL'' ELSE NULL END,'
    END
    +
    CASE WHEN CHARACTER_SET_NAME IS NOT NULL THEN
    'QUOTENAME(['+COLUMN_NAME+'],CHAR(39)+CHAR(39))'
    ELSE
    'CAST(['+COLUMN_NAME+'] AS NVARCHAR(2000))'
    END
    +',''DEFAULT'')'
    FROM INFORMATION_SCHEMA.COLUMNS
    LEFT OUTER JOIN dbo.OperationParamsNullKeyword
    ON OperationType = @Operation
    AND ParamName = COLUMN_NAME
    WHERE TABLE_SCHEMA = 'dbo'
    AND TABLE_NAME = @Operation + N'Params'
    AND ORDINAL_POSITION > 2
    FOR XML PATH('')
    ),
    1,1,'')

SET @sqlstmt = N'SELECT @OperationParams = ' + @sqlstmt + N' FROM dbo.' + QUOTENAME(@Operation + 'Params') + ' WHERE ParamTag = ' + QUOTENAME(@OperationParamTag,CHAR(39)+CHAR(39))

EXEC sp_executesql @sqlstmt,N'@OperationParams varchar(max) OUTPUT',@OperationParams = @OperationParams OUTPUT

END

GO

/* The wrapper stored procedure that you would use in your SQL Agent TSQL job step */

CREATE PROCEDURE [dbo].[RunOperation]
@Operation sysname,
@DatabaseGroup nvarchar(40) = NULL,
@DatabaseName sysname = NULL,
@OperationParamTag nvarchar(255)
AS
BEGIN

DECLARE @OperationParams nvarchar(max) = NULL

EXEC dbo.GetOperationParams @Operation = @Operation,@OperationParamTag = @OperationParamTag,@OperationParams = @OperationParams OUTPUT

DECLARE @Databases nvarchar(max) = NULL

IF @DatabaseName IS NOT NULL 
    SET @Databases = @DatabaseName
ELSE
    SET @Databases = STUFF(
        (
        SELECT ',' + dg.DatabaseName
        FROM dbo.DatabaseGroup dg
        WHERE dg.GroupName = @DatabaseGroup
        /* Check that the group operation is permitted */
        AND EXISTS (
            SELECT 1 FROM dbo.DatabaseGroupOperations dgo
            WHERE dgo.GroupName = @DatabaseGroup
            AND dgo.OperationType = @Operation
            AND dgo.ParamTag = @OperationParamTag
        )
        ORDER BY dg.DatabaseName
        FOR XML PATH('')
        ),
        1,1,'')

SET @Databases = '@Databases=''' + @Databases + ''''
DECLARE @sqlstmt nvarchar(max)

SET @sqlstmt = 'EXEC dbo.' + QUOTENAME(@Operation) + ' ' + @Databases + @OperationParams

PRINT 'Running: ' +  @sqlstmt

EXEC sp_executesql @sqlstmt

END

GO

 

An example wrapper call would be


/* Create a database group */
INSERT INTO dbo.DatabaseGroupParent
VALUES ('GroupUserDBs1')

INSERT INTO dbo.DatabaseGroup
VALUES
('GroupUserDBs1','TestDB'),
('GroupUserDBs1','DBAdmin')

/* Populate the permitted group operations */
INSERT INTO dbo.DatabaseGroupOperations VALUES ('GroupUserDBs1','DatabaseBackup','FullBackups')

/* Now the actual call to backup the database group */
EXEC dbo.RunOperation @Operation = 'DatabaseBackup',@DatabaseGroup='GroupUserDBs1',@OperationParamTag = 'FullBackups'

 

And the results would be

Running: EXEC dbo.[DatabaseBackup] @Databases='DBAdmin,TestDB',@Directory='c:\dbadmin\backup',@BackupType='FULL',@Verify=DEFAULT,@CleanupTime=DEFAULT,@CleanupMode=DEFAULT,@Compress='Y',@CopyOnly=DEFAULT,@ChangeBackupType=DEFAULT,@BackupSoftware=DEFAULT,@CheckSum=DEFAULT,@BlockSize=DEFAULT,@BufferCount=DEFAULT,@MaxTransferSize=DEFAULT,@NumberOfFiles=DEFAULT,@CompressionLevel=DEFAULT,@Description=DEFAULT,@Threads=DEFAULT,@Throttle=DEFAULT,@Encrypt=DEFAULT,@EncryptionAlgorithm=DEFAULT,@ServerCertificate=DEFAULT,@ServerAsymmetricKey=DEFAULT,@EncryptionKey=DEFAULT,@ReadWriteFileGroups=DEFAULT,@OverrideBackupPreference=DEFAULT,@NoRecovery=DEFAULT,@URL=DEFAULT,@Credential=DEFAULT,@MirrorDirectory=DEFAULT,@MirrorCleanupTime=DEFAULT,@MirrorCleanupMode=DEFAULT,@AvailabilityGroups=DEFAULT,@Updateability=DEFAULT,@LogToTable='Y',@Execute=DEFAULT
Date and time: 2016-11-12 21:32:44
Server: WIN2016
Version: 13.0.1601.5
Edition: Developer Edition (64-bit)
Procedure: [DBAdmin].[dbo].[DatabaseBackup]
Parameters: @Databases = 'DBAdmin,TestDB', @Directory = 'c:\dbadmin\backup', @BackupType = 'FULL', @Verify = 'N', @CleanupTime = NULL, @CleanupMode = 'AFTER_BACKUP', @Compress = 'Y', @CopyOnly = 'N', @ChangeBackupType = 'N', @BackupSoftware = NULL, @CheckSum = 'N', @BlockSize = NULL, @BufferCount = NULL, @MaxTransferSize = NULL, @NumberOfFiles = NULL, @CompressionLevel = NULL, @Description = NULL, @Threads = NULL, @Throttle = NULL, @Encrypt = 'N', @EncryptionAlgorithm = NULL, @ServerCertificate = NULL, @ServerAsymmetricKey = NULL, @EncryptionKey = NULL, @ReadWriteFileGroups = 'N', @OverrideBackupPreference = 'N', @NoRecovery = 'N', @URL = NULL, @Credential = NULL, @MirrorDirectory = NULL, @MirrorCleanupTime = NULL, @MirrorCleanupMode = 'AFTER_BACKUP', @AvailabilityGroups = NULL, @Updateability = 'ALL', @LogToTable = 'Y', @Execute = 'Y'
Source: https://ola.hallengren.com
 
Date and time: 2016-11-12 21:32:44
Database: [DBAdmin]
Status: ONLINE
Standby: No
Updateability: READ_WRITE
User access: MULTI_USER
Is accessible: Yes
Recovery model: SIMPLE
Encrypted: No
Differential base LSN: 34000000250700126
Last log backup LSN: N/A
 
Date and time: 2016-11-12 21:32:44
Command: DECLARE @ReturnCode int EXECUTE @ReturnCode = [master].dbo.xp_create_subdir N'c:\dbadmin\backup\WIN2016\DBAdmin\FULL' IF @ReturnCode <> 0 RAISERROR('Error creating directory.', 16, 1)
Outcome: Succeeded
Duration: 00:00:00
Date and time: 2016-11-12 21:32:44
 
Date and time: 2016-11-12 21:32:44
Command: BACKUP DATABASE [DBAdmin] TO DISK = N'c:\dbadmin\backup\WIN2016\DBAdmin\FULL\WIN2016_DBAdmin_FULL_20161112_213244.bak' WITH NO_CHECKSUM, COMPRESSION
Processed 488 pages for database 'DBAdmin', file 'DBAdmin' on file 1.
Processed 7 pages for database 'DBAdmin', file 'DBAdmin_log' on file 1.
BACKUP DATABASE successfully processed 495 pages in 0.026 seconds (148.512 MB/sec).
Outcome: Succeeded
Duration: 00:00:00
Date and time: 2016-11-12 21:32:44
 
Date and time: 2016-11-12 21:32:44
Database: [TestDB]
Status: ONLINE
Standby: No
Updateability: READ_WRITE
User access: MULTI_USER
Is accessible: Yes
Recovery model: SIMPLE
Encrypted: No
Differential base LSN: 34000011863500037
Last log backup LSN: N/A
 
Date and time: 2016-11-12 21:32:44
Command: DECLARE @ReturnCode int EXECUTE @ReturnCode = [master].dbo.xp_create_subdir N'c:\dbadmin\backup\WIN2016\TestDB\FULL' IF @ReturnCode <> 0 RAISERROR('Error creating directory.', 16, 1)
Outcome: Succeeded
Duration: 00:00:00
Date and time: 2016-11-12 21:32:44
 
Date and time: 2016-11-12 21:32:44
Command: BACKUP DATABASE [TestDB] TO DISK = N'c:\dbadmin\backup\WIN2016\TestDB\FULL\WIN2016_TestDB_FULL_20161112_213244.bak' WITH NO_CHECKSUM, COMPRESSION
Processed 4424 pages for database 'TestDB', file 'TestDB' on file 1.
Processed 2 pages for database 'TestDB', file 'TestDB_log' on file 1.
BACKUP DATABASE successfully processed 4426 pages in 0.168 seconds (205.778 MB/sec).
Outcome: Succeeded
Duration: 00:00:00
Date and time: 2016-11-12 21:32:44
 
Date and time: 2016-11-12 21:32:44

 

The table driven approach allows you to easily report out which options your solution is using as well as allowing you to change these options without having to make any changes to the SQL Agent jobs themselves. You can apply these options to user defined groups of databases.

 

Leave a comment

Charting virtual file stats using R

This blog post shows how to retrieve data using R and chart the results in a number of ways. The example data consists of virtual file stats. The source is the virtual file stats DMV so the ‘measures’ are ever increasing values since the last restart of the SQL Server. The deltas could be calculated using SQL windowing functions but I show how to use the R LAG function to achieve the same result. The raw cumulative data is as follows:

virtualfilestats1

The TestDB consists of four data files and the virtual file stats DMV was sampled every 10 minutes. The above shows 30 rows out of a total of 2304 rows (from 1st March to 4th March inclusive). I’m only interested in the ‘number of bytes read’ measure. The same analysis could be done for any of the other measure columns.

Here’s the R code to read from the table and chart the results:


# Use the install.packages lines if this is the first use of the libraries
# install.packages("RODBC")
# install.packages("dplyr")
# install.packages("ggplot2")
# install.packages("scales")

library(RODBC)
library(dplyr)
library(ggplot2)
library(scales)

# Connect to the database
dbhandle <- odbcDriverConnect('driver={SQL Server};server=MySQLServer;database=TestDB;trusted_connection=true')

# Cast the sampletime as separate date and time columns
res <- sqlQuery(dbhandle, "select CAST(sampletime AS date) AS DateValue,CAST(sampletime AS time) AS TimeValue,physical_name,num_of_bytes_read from virtualfilestats order by physical_name,sampletime")

# Load results into a dplyr data frame
df<-tbl_df(res)

# Cast and format the time column
df$TimeValue<-as.POSIXct(df$TimeValue,format = "%H:%M:%S")

# Group the data frame by file name so the lag function works per file name.
# Also, ensure the proper ordering within each group so the delta values are calculated correctly using lag
df<-df%>%group_by(physical_name)%>%arrange(DateValue,TimeValue)

# Use the lag function to calculate the deltas
df<-df%>%mutate(num_of_bytes_read=num_of_bytes_read - lag(num_of_bytes_read))

# One chart per day with colours representing the different file names
ggplot(df,aes(x=TimeValue,y=num_of_bytes_read,colour=physical_name)) + geom_line() + facet_grid(~DateValue) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + scale_x_datetime(breaks=date_breaks("1 hour"), labels=date_format("%H:%M")) + scale_y_continuous(labels = comma)

# One chart per day per file name. Independent y axis for each file name (scales = "free_y")
ggplot(df,aes(x=TimeValue,y=num_of_bytes_read)) + geom_line() + facet_grid(physical_name~DateValue,scales="free_y") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + scale_x_datetime(breaks=date_breaks("1 hour"), labels=date_format("%H:%M")) + scale_y_continuous(labels = comma)

# One chart per day per file name. Same scale for every y axis
ggplot(df,aes(x=TimeValue,y=num_of_bytes_read)) + geom_line() + facet_grid(physical_name~DateValue) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + scale_x_datetime(breaks=date_breaks("1 hour"), labels=date_format("%H:%M")) + scale_y_continuous(labels = comma)

And here are the three charts that result from the ggplot commands:

The first one showing a chart for each day:

virtualfilestats2

The next two are better. The only difference between the two is the independent y axis for the first:

virtualfilestats3

virtualfilestats4

You can see that the most reads occur between 7am and 7pm with more data being read from data file 1 and 2 (perhaps because of proportional fill with different file sizes).

R scripting is great for this type of DBA work. You have full control of the elements that appear on your visualisations using the ggplot2 library. Coupled with this the dplyr library contains a wealth of windowing functions to deal with the most common data trends seen by DBAs.

 

Leave a comment

Programmatically generating charts

Excel is all well and good when you want to manually plot your data. You manually load your data, create a bar chart, format axes, change the alignment for labels, add a title and save as a file for your team to review. But in an age where everyone is using tools such as Powershell or Python for their administration tasks it makes sense to use a scripting tool to generate charts. After all, as data professionals we all know how to programmatically read from a CSV file or from an SQL Server table. This blog post shows just how easily and elegantly you can generate a publication quality chart using the ggplot2 library in R.

R is an established data analysis tool. There are a weath of libraries for all sorts of statistical analysis. Many primitives that exist in SQL Server have counterparts in R. For example, a table can be represented as a data frame. In terms of data frame operations (ie selecting, filtering, grouping, ordering, windowing functions) a library that implements most of the SQL type operations is the dplyr library. This blog post does not concern dplyr but one its authors (Hadley Wickham) has also authored the ggplot2 library. This library implements “the grammar of graphics” functionality. You programmatically generate your chart using the functions within this library. An example making use of this library now follows.

Suppose you have the following CSV file containing summary results for eight different SQL Server configurations. The raw data for each configuration consists of average transactions per second, end-to-end total batch time, disk service time for the transaction log volume, average seconds per write for the transaction log volume:

config,avgtps,totalbatchtime,disktime,avgdisk
instance2,375,7.33,1.27,0.414
instance3highcpuplan,794,4.08,1.00,0.333
instance4,391,8.01,1.28,0.426
instance5,296,10.6,1.76,0.585
instance6,351,8.95,1.99,0.658
instance2highcpuplan,822,4.04,0.9896,0.325
instance1san,589,5.42,1.10,0.365
instance1ssd,939,3.49,0.10,0.038

You want to visualise the effect of SSD on reducing total batch time compared to the effect of the high performance CPU power plan.

Assuming you have R and RStudio installed on your workstation this is all the code you need to generate a publication quality chart for the results:

# Install the packages if this is the first time they are being used by the user

# install.packages('dplyr')
# install.packages('ggplot2')
# install.packages('reshape2')

# Use the libraries
library(dplyr)
library(ggplot2)
library(reshape2)

# Read the raw data into a native R data frame
# You could just as easily read the raw data from SQL Server using the RODBC library (see a future blog post)
rawdata<-read.csv('c:\\temp\\data.csv')

# Load the raw data into a dplyr data frame and display the "wide" format results
df<-tbl_df(rawdata)
df

# The df data frame holds its contents in what's know as "wide" format. That is, with columns for the four measures.
# Charting works most naturally with "narrow" format data so we unpivot the results using the reshape2 melt function (author Hadley Wickham again)
plotdata <- melt(df, id = c("config"), measure = c("avgtps", "totalbatchtime","disktime","avgdisk"))

# Display the "narrow" format results. This is format that ggplot will operate on
plotdata

# Plot the results
ggplot(plotdata, aes(x = config, y = value)) +
geom_bar(stat="identity") +
facet_wrap(~variable,scales="free_y",ncol=1) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
geom_text(aes(label = value, y = value, vjust=-1), size = 3)

The output is as follows:

df in “wide” format:

Source: local data frame [8 x 5]

                config avgtps totalbatchtime disktime avgdisk
1            instance2    375           7.33   1.2700   0.414
2 instance3highcpuplan    794           4.08   1.0000   0.333
3            instance4    391           8.01   1.2800   0.426
4            instance5    296          10.60   1.7600   0.585
5            instance6    351           8.95   1.9900   0.658
6 instance2highcpuplan    822           4.04   0.9896   0.325
7         instance1san    589           5.42   1.1000   0.365
8         instance1ssd    939           3.49   0.1000   0.038

plotdata in “narrow” format:

                 config       variable    value
1             instance2         avgtps 375.0000
2  instance3highcpuplan         avgtps 794.0000
3             instance4         avgtps 391.0000
4             instance5         avgtps 296.0000
5             instance6         avgtps 351.0000
6  instance2highcpuplan         avgtps 822.0000
7          instance1san         avgtps 589.0000
8          instance1ssd         avgtps 939.0000
9             instance2 totalbatchtime   7.3300
10 instance3highcpuplan totalbatchtime   4.0800
11            instance4 totalbatchtime   8.0100
12            instance5 totalbatchtime  10.6000
13            instance6 totalbatchtime   8.9500
14 instance2highcpuplan totalbatchtime   4.0400
15         instance1san totalbatchtime   5.4200
16         instance1ssd totalbatchtime   3.4900
17            instance2       disktime   1.2700
18 instance3highcpuplan       disktime   1.0000
19            instance4       disktime   1.2800
20            instance5       disktime   1.7600
21            instance6       disktime   1.9900
22 instance2highcpuplan       disktime   0.9896
23         instance1san       disktime   1.1000
24         instance1ssd       disktime   0.1000
25            instance2        avgdisk   0.4140
26 instance3highcpuplan        avgdisk   0.3330
27            instance4        avgdisk   0.4260
28            instance5        avgdisk   0.5850
29            instance6        avgdisk   0.6580
30 instance2highcpuplan        avgdisk   0.3250
31         instance1san        avgdisk   0.3650
32         instance1ssd        avgdisk   0.0380

 

The bar charts making use of the facets command to show each measure as a separate chart and “free_y” scales to show independent y axes.  Also, 90 degree rotation of x axis labels and measure value placed above each bar.

 

Example ggplot chart

 

Visualising the results you can see that although instance1 with SSD had dramatically reduced transaction log latency compared to instance1 on SAN in terms of reduction of total batch time the bigger “bang for the buck” was seen for instance 2 going from a balance CPU power plan to a high performance CPU power plan. The total batch time went down by almost 3.3 seconds and the average transactions per second more than doubled. This is interesting but is not the point of this blog post.

Using the “grammar of graphics” approach implemented in the ggplot2 library you have complete programmatic control of the elements you wish to see presented and the style of each element. And the great thing is that you can send not just the image to your end users but the command you used to generate the image as well. This command is a piece of script so can be treated as source code that you can version control and maintain history. You can’t do any of this for an Excel based solution.

You don’t need to be an R coding expert to use the scripting language. The learning curve is far lower than that for Powershell. And since the language is a data language and the operations that are performed tend to be vector and matrix type operations it’s a natural fit for those professionals who are already familiar with set based programming with SQL.

With Microsoft’s acquisition of Revolution Analytics there’s a high chance that R will become more prominent in the toolset that Microsoft provide to SQL Server professionals.

 

Leave a comment

Column store storage for perfmon data

Existing solutions for storing perfmon data in a SQL Server database typically use traditional b-tree type tables to house the data (for example, the PAL tool and RELOG.EXE when the destination is a SQL table). This article shows a very simple schema for storing perfmon data in a column store table.

Because perfmon data mostly consists of well defined repeating values it is an ideal candidate for compressed storage. Depending on the sampling period the amount of data being collected is likely to be very high so a high capacity data storage structure that can be queried very quickly is also desirable. Lastly, because the data is not modified after it is collected it appears as read-only data within the database. All these attributes are a perfect match for column store storage.

Here’s one possible schema that can be used for storing perfmon data in a column store table:

CREATE TABLE [dbo].[factCounterData](
[ServerID] [int] NOT NULL,
[DateID] [int] NOT NULL,
[TimeID] [int] NOT NULL,
[CounterID] [int] NOT NULL,
[CounterValue] [float] NOT NULL
)
GO

CREATE CLUSTERED COLUMNSTORE INDEX [cci_factCounterData] ON [dbo].[factCounterData] WITH (DROP_EXISTING = OFF)
GO

The are various ways in which you may be collecting perfmon data. This article won’t discuss these options but will assume the data is present in a series of perfmon BLG files. These files can be imported into a database in a number of ways but the simplest method is to use the RELOG command. You create a system ODBC data source name (DSN) and specify this when running the command. For example,

RELOG.EXE \\MyServerName\d$\perflogs\admin\perfdata*.blg -f SQL -o SQL:PerfmonDb!Win2012r2

For this example, the machine that runs RELOG.EXE has an ODBC system DSN named PerfmonDb that specifies the target database.

On first execution RELOG.EXE creates these tables in the target database:

CREATE TABLE [dbo].[CounterData](
[GUID] [uniqueidentifier] NOT NULL,
[CounterID] [int] NOT NULL,
[RecordIndex] [int] NOT NULL,
[CounterDateTime] [char](24) NOT NULL,
[CounterValue] [float] NOT NULL,
[FirstValueA] [int] NULL,
[FirstValueB] [int] NULL,
[SecondValueA] [int] NULL,
[SecondValueB] [int] NULL,
[MultiCount] [int] NULL,
PRIMARY KEY CLUSTERED
(
[GUID] ASC,
[CounterID] ASC,
[RecordIndex] ASC
)
)
GO

CREATE TABLE [dbo].[CounterDetails](
[CounterID] [int] IDENTITY(1,1) NOT NULL,
[MachineName] [varchar](1024) NOT NULL,
[ObjectName] [varchar](1024) NOT NULL,
[CounterName] [varchar](1024) NOT NULL,
[CounterType] [int] NOT NULL,
[DefaultScale] [int] NOT NULL,
[InstanceName] [varchar](1024) NULL,
[InstanceIndex] [int] NULL,
[ParentName] [varchar](1024) NULL,
[ParentObjectID] [int] NULL,
[TimeBaseA] [int] NULL,
[TimeBaseB] [int] NULL,
PRIMARY KEY CLUSTERED
(
[CounterID] ASC
)
)
GO

CREATE TABLE [dbo].[DisplayToID](
[GUID] [uniqueidentifier] NOT NULL,
[RunID] [int] NULL,
[DisplayString] [varchar](1024) NOT NULL,
[LogStartTime] [char](24) NULL,
[LogStopTime] [char](24) NULL,
[NumberOfRecords] [int] NULL,
[MinutesToUTC] [int] NULL,
[TimeZoneName] [char](32) NULL,
PRIMARY KEY CLUSTERED
(
[GUID] ASC
),
UNIQUE NONCLUSTERED
(
[DisplayString] ASC
)
)
GO

Subsequent executions of the RELOG.EXE command append to the data that exists in the dbo.CounterData table. Sample data from each of the tables is as follows:

DisplayToID:

colstoreperfmon1

CounterDetails:

colstoreperfmon2

CounterData:

colstoreperfmon3

Now if you forget the column store table and simply query the three tables that RELOG.EXE creates you would have a short term workable solution for your perfmon reporting needs. However, as the volume of data in CounterData grows you will find that your storage requirements grow and your reporting queries take longer to run.

In data warehousing terms CounterData is the fact table. This will become very large. The two other tables are the dimension tables. These will never become really large. The column store solution will focus on the CounterData table.

The columns of interest in CounterData are GUID (link to DisplayToID dimension table which holds the DisplayString name – the server name), CounterID (link to CounterDetail dimension table which holds the counter names), CounterDateTime and CounterValue. The column store table will adopt best practice and store as many of the dimension links as integer columns. You can see from the structure of the factCounterData table at the start of this article that the GUID is not stored. Instead an integer ServerID is stored. Also, the CounterDateTime value is split into DateID and TimeID. Here’s the code that populates the column store table:

 

-- Dimension table to hold the GUID to ServerID mapping
CREATE TABLE [dbo].[dimServer](
[ServerID] [int] IDENTITY(1,1) NOT NULL,
[ServerName] [varchar](1024) NOT NULL,
[GUID] [uniqueidentifier] NOT NULL,
CONSTRAINT [PK_dimServer] PRIMARY KEY CLUSTERED
(
[ServerID] ASC
)
)
-- Routine to populate dimServer. Insert row if it's the first time we're seeing the DisplayString
CREATE PROCEDURE [dbo].[procInsertDimServer] AS
BEGIN
SET NOCOUNT ON

INSERT INTO dbo.dimServer (ServerName,GUID)
SELECT DisplayString,GUID
FROM dbo.DisplayToID d
WHERE NOT EXISTS (SELECT * FROM dbo.dimServer s WHERE s.ServerName = d.DisplayString)
END
CREATE PROCEDURE [dbo].[procInsertFactCounterData]
AS
BEGIN
SET NOCOUNT ON

-- Insert the server row if we've not inserted it during previous executions
EXEC dbo.procInsertDimServer

-- Now insert the fact data. Split the datetime value into separate date and time columns
INSERT INTO dbo.factCounterData (
                ServerID,
                DateID,
                TimeID,
                CounterID,
                CounterValue
)
SELECT
                s.ServerID,
                CAST(REPLACE(LEFT(c.CounterDateTime,10),'-','') AS int),
                DATEDIFF(second,'00:00:00',SUBSTRING(c.CounterDateTime,12,8)),
                c.CounterID,
                c.CounterValue
FROM dbo.CounterData c
INNER JOIN dbo.dimServer s
ON c.GUID = s.GUID
END
-- CounterData is now just a staging table so can be cleared down before running RELOG.EXE
CREATE PROCEDURE [dbo].[procClearDownCounterData]
AS
BEGIN
SET NOCOUNT ON
TRUNCATE TABLE dbo.CounterData
END

With this code in place the typical execution steps are

  1. Execute dbo.procClearDownCounterData
  2. Run the RELOG.EXE command to populate CounterData, CounterDetails and DisplayToID
  3. Execute dbo.procInsertFactCounterData

In terms of storage costs, for one server capturing one hundred different performance counters every two seconds I found over 650 million rows in the factCounterData occupied only two gigabytes of database space. The compression is dependent on the number of repeating values but you should fine impressive storage reductions over conventional b-tree storage.

In terms of reporting, here’s a parameterised stored procedure that can be used to query a server for a specified counter. Start and end time ranges can be supplied when calling the procedure:


USE [perfmondb]
GO

/****** Object:  StoredProcedure [dbo].[procGetCounterValues]    Script Date: 13/01/2015 23:23:16 ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE PROCEDURE [dbo].[procGetCounterValues]
@ServerName varchar(1024),
@ObjectName varchar(1024) = NULL,
@CounterName varchar(1024) = NULL,
@InstanceName varchar(1024) = NULL,
@Start datetime = NULL,
@End datetime = NULL
AS
BEGIN

SET NOCOUNT ON

DECLARE @StartDateID int = CONVERT(char(8),@Start,112)
DECLARE @StartTimeID int = DATEDIFF(second,'00:00:00',CAST(@Start AS time))
DECLARE @EndDateID   int = CONVERT(char(8),@End,112)
DECLARE @EndTimeID   int = DATEDIFF(second,'00:00:00',CAST(@End AS time))

SELECT
s.ServerName,
DATEADD(second,f.TimeID,CAST(f.DateID AS char(8))) AS CounterTime,
c.ObjectName,
c.CounterName,
c.InstanceName,
f.CounterValue
FROM dbo.factCounterData f
INNER JOIN dbo.dimServer s
ON s.ServerID = f.ServerID
INNER JOIN dbo.CounterDetails c
ON c.CounterID = f.CounterID
WHERE s.ServerName = @ServerName
AND (@ObjectName IS NULL OR c.ObjectName = @ObjectName)
AND (@CounterName IS NULL OR c.CounterName = @CounterName)
AND (@InstanceName IS NULL OR c.InstanceName = @InstanceName)
AND (
@StartDateID IS NULL
OR f.DateID > @StartDateID
OR (f.DateID = @StartDateID AND f.TimeID >= @StartTimeID)
)
AND (
@EndDateID IS NULL
OR f.DateID < @EndDateID
OR (f.DateID = @EndDateID AND f.TimeID <= @EndTimeID)
)
ORDER BY CounterTime
END

GO

An example call:

EXECUTE [dbo].[procGetCounterValues]
@ServerName = ‘Win2012r2’,
@CounterName = ‘% Processor Time’,
@InstanceName = ‘_Total’,
@Start = ‘20141130 16:46:41’,
@End = ‘20141130 16:46:59’

Output is this:

colstoreperfmon4

So ten rows out of a total of over one million rows for my data set. And the plan showed batch mode execution for the far right column store index scan:

colstoreperfmon5

 

Even querying the table when it held over 650 million rows I found my queries returned in less than three seconds.

So storage space and query response is optimised when using column store storage for performance monitor data.

Leave a comment

Failed full backup can impact next differential backup

For any of your very large databases you are probably taking once a week full backups followed by daily differential backups. We know that differential change map (DCM) pages hold references to the extents that have changed since the last full backup and it is these that are examined when a differential backup is run. The DCM pages allow the changed extents to be quickly identified.

Using your maintenance routines you are probably finding that your full database backup takes a couple of hours to complete and backs up a terabyte worth of data. Your daily differential backups complete in minutes and backup a hundred gigabytes worth of data by the time the next full backup is due. This is all good and your maintenance fits into the windows you have for administration.

But what if you run the once a week full backup on a Sunday, it fails and you notice on a Monday morning when your maintenance window is closed ? It’s probably the case that in terms of the backup chain the full backup failure doesn’t matter. You keep on taking differential backups with these being based on the previous week’s full database backup. You have no problem in terms of recovery. But what you’ll be surprised to find is that the first and subsequent differential backups since the failed full backup will take substantially longer to complete particularly when compared to the previous week’s elapsed times. This will be the case even though the volume of data changes hitting your database hasn’t changed since the previous week. The differential backup duration may increase several fold compared to the previous week.

The reason for the increase in the differential backup time is that the backup process now examines ever single data page for changes. It won’t be until the next full backup successfully completes that the optimised mode of the differential backup will run (the mode that uses the DCM pages to identify changed extents). Here is a repro script illustrating the issue:


-- Create a test table in one of your testing databases
-- Created so that one row occupies one page so that we can easily makes sense of the page numbers that are reported
-- in the backup output

CREATE TABLE testdb.dbo.mytable (
id int NOT NULL,
val char(7000) NOT NULL
)
GO

-- Populate the table with 50,000 rows (50,000 data pages)
SET NOCOUNT ON
GO

INSERT INTO testdb.dbo.mytable
VALUES (1,'hello')
GO 50000

-- Take a full backup
BACKUP DATABASE testdb TO DISK='C:\SQL2014\Backup\testdb.bak' WITH INIT,NO_COMPRESSION
GO

/*
Processed 50656 pages for database 'testdb', file 'testdb' on file 1.
Processed 1 pages for database 'testdb', file 'testdb_log' on file 1.
BACKUP DATABASE successfully processed 50657 pages in 4.865 seconds (81.346 MB/sec).
*/

-- You can see that roughly 50,000 pages were backed up

-- Now that previous backup was successful. Let's see how quick the next differential backup is.
-- It should be fast as there have been no data changes since the last full backup.
BACKUP DATABASE testdb TO DISK='C:\SQL2014\Backup\testdbdiff.bak' WITH INIT,DIFFERENTIAL,NO_COMPRESSION
GO

/*
Processed 56 pages for database 'testdb', file 'testdb' on file 1.
Processed 1 pages for database 'testdb', file 'testdb_log' on file 1.
BACKUP DATABASE WITH DIFFERENTIAL successfully processed 57 pages in 0.015 seconds (29.687 MB/sec).
*/

-- And, yes, it is very quick with only 56 database pages being examined.

-- Now run the full backup again but interrupt it before completion to simulate an error you may get in your production system
BACKUP DATABASE testdb TO DISK='C:\SQL2014\Backup\testdb.bak' WITH INIT,NO_COMPRESSION
GO

/*
Msg 3204, Level 16, State 1, Line 69
The backup or restore was aborted.
Msg 3013, Level 16, State 1, Line 69
BACKUP DATABASE is terminating abnormally.
Query was cancelled by user.
*/

-- Now run the differential backup. It should be quick.
BACKUP DATABASE testdb TO DISK='C:\SQL2014\Backup\testdbdiff.bak' WITH INIT,DIFFERENTIAL,NO_COMPRESSION
GO

/*
Processed 50656 pages for database 'testdb', file 'testdb' on file 1.
Processed 1 pages for database 'testdb', file 'testdb_log' on file 1.
BACKUP DATABASE WITH DIFFERENTIAL successfully processed 50657 pages in 1.890 seconds (209.392 MB/sec).
*/

50,000+ database pages were examined for the differential backup that followed the failed full backup. That equates to every page in the database. For terabyte sized databases that will result in a very high elapsed time. It won’t be until the next successful full backup that the differential backup will return to its ‘optimised’ behaviour:

BACKUP DATABASE testdb TO DISK='C:\SQL2014\Backup\testdb.bak' WITH INIT,NO_COMPRESSION
GO

/*
Processed 50656 pages for database 'testdb', file 'testdb' on file 1.
Processed 2 pages for database 'testdb', file 'testdb_log' on file 1.
BACKUP DATABASE successfully processed 50658 pages in 3.784 seconds (104.587 MB/sec).
*/

BACKUP DATABASE testdb TO DISK='C:\SQL2014\Backup\testdbdiff.bak' WITH INIT,DIFFERENTIAL,NO_COMPRESSION
GO

/*
Processed 40 pages for database 'testdb', file 'testdb' on file 1.
Processed 1 pages for database 'testdb', file 'testdb_log' on file 1.
BACKUP DATABASE WITH DIFFERENTIAL successfully processed 41 pages in 0.016 seconds (20.019 MB/sec).
*/

These observations would suggest that the DCM page content is examined at the start of the full backup and cleared as the backup progresses. If the backup fails it seems that the DCM page content is not rolled back to its pre-full backup state. Subsequent differential backups somehow detect that the previous full backup has failed (some sort of flag in the boot page of DCM pages, perhaps) and run in a mode where they examines all database pages. These cases will show up as high reads in your virtual file stats, your perfmon disk reads and your SAN disk reads.

So for the differential backups of your big databases look out for these being impacted when your full database backups fail.

Leave a comment

Reading extended event files using client side tools only

For any serious tracing working you’ll be using extended events and sending the output to XEL target files. Various options exist when it comes to reading these files. You can open the files using the file open function in Management Studio. Or you can use the sys.fn_xe_file_target_read_file function to have your SQL Server database engine read the files.

If you want to read the XEL files programmatically and don’t want to be constrained by having a database connection you can use a .Net program and use the extended event reader functionality in Microsoft.SqlServer.XEvent.Linq.dll. For example, this post describes an approach using C#:

http://sqlblog.com/blogs/extended_events/archive/2011/07/20/introducing-the-extended-events-reader.aspx

If you prefer to use Powershell here’s one solution:

# If doing this on a client that just has the client tools and no database engine then you have to use 32 bit Powershell as Microsoft
# have not provided a 64 bit assembly for the client tools distribution.
# If you are running this on a server that has the database engine then you don't need to use 32 bit Powershell as the 64 bit assembly
# is present in the C:\Program Files\Microsoft SQL Server\120\Shared directory

# For SQL 2014 you have to add the reference to Microsoft.SqlServer.XE.Core.dll. You don't need this for SQL 2012

Add-Type -Path 'C:\Program Files (x86)\Microsoft SQL Server\120\Tools\Binn\ManagementStudio\Extensions\Application\Microsoft.SqlServer.XE.Core.dll'
Add-Type -Path 'C:\Program Files (x86)\Microsoft SQL Server\120\Tools\Binn\ManagementStudio\Extensions\Application\Microsoft.SqlServer.XEvent.Linq.dll'

$events = New-Object Microsoft.SqlServer.XEvent.Linq.QueryableXEventData("c:\temp\test.xel")

$events | Foreach-Object { $_.Actions | Where-Object { $_.Name -eq 'client_hostname' } } | Group-Object Value

This can be used if, for example, you’re tracing the sql_batch_completed event and have added the client_hostname action. The script outputs the counts for each host name in the XEL file.

As can be seen in the comments it’s unfortunate that for the client side only solution we have no choice but to use the 32 bit assemblies. This is still the case for the SQL 2014 client side of things.

 

2 Comments