Archive for April, 2014

Excessive logging pre-SQL 2012 when changing variable length columns from NULL to NOT NULL

In pre-SQL Server 2012 versions you may encounter excessive transaction logging when you use the ALTER TABLE ALTER COLUMN command to change NULL to NOT NULL. Depending on the record size the amount of transaction log space used could be over three times the size of the table you are altering. This post shows that this is specifically an issue when changing variable length columns from NULL to NOT NULL. The issue does not occur for the same change applied to fixed length columns.

The basis of this post is this StackExchange thread: http://dba.stackexchange.com/questions/29522/why-does-alter-column-to-not-null-cause-massive-log-file-growth

The thread pretty much gives the explanation to the problem but the point about fixed versus variable length is not entirely clear. Here’s a test script which explains the differences:


IF OBJECT_ID('dbo.testalternotnull') IS NOT NULL DROP TABLE dbo.testalternotnull;

CREATE TABLE dbo.testalternotnull
(
id int NOT NULL,
col varchar(3) NULL
);
GO

INSERT INTO dbo.testalternotnull VALUES (1,'abc')
GO 10

CHECKPOINT;

ALTER TABLE dbo.testalternotnull ALTER COLUMN col varchar(3) NOT NULL;

SELECT * FROM fn_dblog(null,null);

You can see the ten LOP_MODIFY_ROW entries where the AllocUnitName is dbo.testalternotnull

altertonotnull1

Also note the log reserve figures.

Now let’s check the case where we are changing from NOT NULL to NULL:


IF OBJECT_ID('dbo.testalternotnull') IS NOT NULL DROP TABLE dbo.testalternotnull;

CREATE TABLE dbo.testalternotnull
(
id int NOT NULL,
col varchar(3) NOT NULL
);
GO

INSERT INTO dbo.testalternotnull VALUES (1,'abc')
GO 10

CHECKPOINT;

ALTER TABLE dbo.testalternotnull ALTER COLUMN col varchar(3) NULL;

SELECT * FROM fn_dblog(null,null);

altertonotnull2

No reference to the table name here so this is just a metadata change.

Now let’s check the results for a fixed length column changing NULL to NOT NULL:


IF OBJECT_ID('dbo.testalternotnull') IS NOT NULL DROP TABLE dbo.testalternotnull;

CREATE TABLE dbo.testalternotnull
(
id int NOT NULL,
col char(3) NULL
);
GO

INSERT INTO dbo.testalternotnull VALUES (1,'abc')
GO 10

CHECKPOINT;

ALTER TABLE dbo.testalternotnull ALTER COLUMN col char(3) NOT NULL;

SELECT * FROM fn_dblog(null,null);

altertonotnull3

No references to LOP_MODIFY_ROW where AllocUnitName is dbo.testalternotnull so for fixed length columns this is a metadata only change.

I won’t show the output for changing a CHAR(3) NOT NULL to NULL as the result is the same as that for VARCHAR(3) NOT NULL to NULL. That is, it is a metadata only change.

What about BIT columns ? Well given they are represented in the record layout as a byte packed structure they can be considered to be a variable type. Let’s check this by showing the output for changing a BIT NULL to a BIT NOT NULL:


IF OBJECT_ID('dbo.testalternotnull') IS NOT NULL DROP TABLE dbo.testalternotnull;

CREATE TABLE dbo.testalternotnull
(
id int NOT NULL,
col BIT NULL
);
GO

INSERT INTO dbo.testalternotnull VALUES (1,0)
GO 10

CHECKPOINT;

ALTER TABLE dbo.testalternotnull ALTER COLUMN col BIT NOT NULL;

SELECT * FROM fn_dblog(null,null);

altertonotnull4

So, yes, you can consider the BIT type to be a variable type and changing it from NULL to NOT NULL can result in high transaction log use.

In summary, this post shows examples of high, and what you may consider excessive, transaction log use when changing variable length columns from NULL to NOT NULL. The StackExchange thread referred to at the start of this post shows that the resulting size of the transaction log may be many times the size of the table being altered. Fortunately, for SQL Server 2012 and above most column changes appear to be metadata only changes.

Leave a comment

Delayed durability: performance boost for certain workloads

If some of your workloads are able to tolerate a certain amount of data loss and they are already bottlenecked on the write performance of the transaction log then SQL Server 2014’s delayed durability option will probably give you immediate performance boost. One use case that comes to mind is staging databases where ETL work is being performed. Some of the data feeds may be implemented as single row inserts as opposed to batch inserts. Within the application environment you may have no options available to convert to batch inserts (ie multiple inserts in one transaction). This post will show the performance improvements you could see by enabling delayed durability.

Single row inserts going into this table:


CREATE TABLE dbo.TestingData
(
id INT NOT NULL IDENTITY,
col1 INT NOT NULL DEFAULT 2000,
col2 VARCHAR(50) NOT NULL DEFAULT 'wxyz'
)

Test script inserting 10000 rows:


SET NOCOUNT ON
DECLARE @i INT = 1
WHILE @i <= 10000
BEGIN
INSERT INTO dbo.TestingData( col1, col2 ) VALUES (DEFAULT, DEFAULT)
SET @i = @i + 1
END

Toggle the delayed durability option using this:


ALTER DATABASE [DelayedDurabilityTest] SET DELAYED_DURABILITY = <FORCED or DISABLED> WITH NO_WAIT

Results inserting 10000 rows (captured using before and after sampling of sys.dm_io_virtual_file_stats):

             sampletime    file_id    reads    readbytes    bytesperread    writes    writebytes    bytesperwrite
2014-04-21 10:22:02.967    1          0        0            0               0         0             0
2014-04-21 10:22:02.967    2          0        0            0               10000     5154816       515
2014-04-21 11:00:01.947    1          0        0            0               0         0             0
2014-04-21 11:00:01.947    2          0        0            0               1733      4046336       2334

The test at 10:22 was conducted with delayed durability disabled (the default). File id 1 is the data file, file id 2 is the log file. The test at 11:00 was conducted with delayed durability set to forced. You can see the reduced number of writes to the transaction log and the higher figure for bytes per write. The sysinternals ProcessMonitor utility confirms these details. For the 10:22 test we see this:

delayeddurability1

The Sequence column starts at 0 so this is showing us 10000 writes each being 512 bytes (matches the results from sys.dm_io_virtual_file_stats). For the 11:00 test we see this:

delayeddurability2

The Sequence column started at 10000, it ends at 11732 so that’s 1733 write IOs matching the figure from the DMV. The Detail column shows the larger write sizes.

Following this simple test I changed 10000 to 500000 in the while loop that does the insert and captured the performance monitor counters for transactions/sec, log flushes and %processor time. Here are the results:

delayeddurability3

The execution took over five minutes with the option disabled and forty seconds when it was set to forced. You can see in black the improvement in transactions/sec

The actual transaction log content is identical for both executions (not surprising as the option simply controls when the log buffer is written to disk, not its content). Here are the results for two transaction entries using fn_dblog:

delayeddurability4

 

Delayed durability is available in all editions of SQL Server 2014. For workloads that can tolerate a certain amount of data loss you’ll probably notice an improvement in your transaction processing rate.

Leave a comment

Data analysis operations using the Python pandas module

Within the SQL Server world we tend to use tools such as raw SQL or MDX code, Excel, DAX in Power Pivot and may be even the M language in Power Query to do most of our data analysis work. Outside of these Microsoft orientated tools we may even use the R language which despite being around for a very long time is only now being recognised within the SQL Server business intelligence community. In this post I want to mention another data analysis tool which is feature rich, has a large user base (as a result of its parent language) and which can be used to bridge the gap between the Windows and SQL Server world and the other worlds which include technologies such as NoSQL. The data analysis tool that I’m referring to is the pandas module used within the Python language.

As is the case for R the Python language has a very large user base. The Python pandas module provides the DataFrame object which is analogous to the same object in R. SQL type operations such as join, merge, group by, pivot can all be performed on pandas DataFrames. I’ll illustrate the use of the pandas module using some very simple examples. I’m using the freely available Anaconda Python 2.7 (64 bit) distribution on Windows. You can obtain this from https://store.continuum.io/cshop/anaconda/.  I’m using the interactive IPython qtconsole to run all the code.

This first section of code shows how to connect to a SQL Server data source:


In [1]: import pandas as pd

In [2]: import pandas.io.sql as psql

In [3]: import pypyodbc

In [4]: conn = pypyodbc.connect('DRIVER={SQL Server};SERVER=mysqlserver;DATABASE=mydatabase;Trusted_Connection=yes')

In [5]: waitstats = psql.read_frame('SELECT sampletime,wait_type,waiting_tasks_count,wait_time_ms FROM dbo.capwaitstats', conn)

In [6]: waitstats.head(10)
Out[6]:
                  sampletime      wait_type  waiting_tasks_count  wait_time_ms
0 2014-03-25 11:19:15.610000  MISCELLANEOUS                    0             0
1 2014-03-25 11:19:15.610000    LCK_M_SCH_S                    0             0
2 2014-03-25 11:19:15.610000    LCK_M_SCH_M                   15             6
3 2014-03-25 11:19:15.610000        LCK_M_S                    5          1029
4 2014-03-25 11:19:15.610000        LCK_M_U                    0             0
5 2014-03-25 11:19:15.610000        LCK_M_X                    0             0
6 2014-03-25 11:19:15.610000       LCK_M_IS                    0             0
7 2014-03-25 11:19:15.610000       LCK_M_IU                    0             0
8 2014-03-25 11:19:15.610000       LCK_M_IX                    0             0
9 2014-03-25 11:19:15.610000      LCK_M_SIU                    0             0

[10 rows x 4 columns]

The pypyodbc module is not included in the Anaconda Python distribution but is very easy to install once Python is present. You simply run this to install it:

pip install pypyodbc

The actual execution above fetches data from an historical waitstats table. This table stores the waitstats in their raw form. That is, just like the DMV that they are sourced from the counts and times are cumulative figures since the start time of the SQL Server. The head command is used to show the first 10 rows of the dataframe. The first column (showing 0 to 9) represents the default index for the dataframe.

You’ll notice that I didn’t specify an ‘ORDER BY wait_type,sampletime’ clause in the SQL that fetches the data from the database. It’s best practice to include such a clause but if you haven’t or are unable to do so then you can sort the dataframe as shown next:

In [7]: waitstats.sort(['wait_type','sampletime'], inplace=True)
In [8]: waitstats.set_index(['wait_type','sampletime'],inplace=True)

In [9]: waitstats.ix['OLEDB']
Out[9]:                     waiting_tasks_count  wait_time_ms
sampletime                                                   
2014-03-25 11:19:15.610000                22733          1534
2014-03-25 11:19:20.613000                22733          1534
2014-03-25 11:19:25.617000                24257          1677
2014-03-25 11:19:30.620000                24257          1677
2014-03-25 11:19:35.623000                24257          1677
2014-03-25 11:19:40.627000                24257          1677
2014-03-25 11:19:45.630000                24257          1677
2014-03-25 11:19:50.633000                24257          1677
2014-03-25 11:19:55.637000                24257          1677
2014-03-25 11:20:00.640000                25781          1816
2014-03-25 11:20:05.643000                25781          1816
2014-03-25 11:20:10.647000                26641          1876
2014-03-25 11:20:15.650000                26641          1876
2014-03-25 11:20:20.653000                26641          1876
2014-03-25 11:20:25.657000                26641          1876
2014-03-25 11:20:30.660000                26641          1876
2014-03-25 11:20:35.663000                26641          1876
2014-03-25 11:20:40.667000                27435          1935
2014-03-25 11:20:45.670000                27435          1935
2014-03-25 11:20:50.673000                27435          1935

[20 rows x 2 columns]

After the sort I create a multi-index on wait_type and sampletime. I show the use of an index fetch using ix to select out one of the wait types.

As mentioned above the waitstats are cumulative figures. The dataframe will now be grouped by the wait_type index level and the diff command (equivalent of lead/lag) will be used to obtain the waitstat deltas:


In [10]: waitstats = waitstats.groupby(level=['wait_type']).diff()

In [11]: waitstats.ix['OLEDB']
Out[11]:
                            waiting_tasks_count  wait_time_ms
sampletime                                                   
2014-03-25 11:19:15.610000                  NaN           NaN
2014-03-25 11:19:20.613000                    0             0
2014-03-25 11:19:25.617000                 1524           143
2014-03-25 11:19:30.620000                    0             0
2014-03-25 11:19:35.623000                    0             0
2014-03-25 11:19:40.627000                    0             0
2014-03-25 11:19:45.630000                    0             0
2014-03-25 11:19:50.633000                    0             0
2014-03-25 11:19:55.637000                    0             0
2014-03-25 11:20:00.640000                 1524           139
2014-03-25 11:20:05.643000                    0             0
2014-03-25 11:20:10.647000                  860            60
2014-03-25 11:20:15.650000                    0             0
2014-03-25 11:20:20.653000                    0             0
2014-03-25 11:20:25.657000                    0             0
2014-03-25 11:20:30.660000                    0             0
2014-03-25 11:20:35.663000                    0             0
2014-03-25 11:20:40.667000                  794            59
2014-03-25 11:20:45.670000                    0             0
2014-03-25 11:20:50.673000                    0             0

[20 rows x 2 columns]

In [12]: waitstats = waitstats.dropna()

You can see that with the assignment statement I’m overwriting the original waitstats dataframe that held the cumulative figures with the results of the groupby holding just the deltas. The dropna command is used to remove the NaN rows.

Now to identify the top five waits based on the sum of the waiting_tasks_counts I use this:


In [13]: topwaits = waitstats.groupby(level=['wait_type'])['waiting_tasks_count'].sum().order(ascending=False)[:5]

In [14]: topwaits
Out[14]:
wait_type
OLEDB               4702
LATCH_EX            1524
ASYNC_NETWORK_IO     914
SLEEP_TASK           505
PAGEIOLATCH_SH       450
Name: waiting_tasks_count, dtype: float64

The wait_type index from the topwaits object can be used to index into the main waitstats dataframe to fetch the full data for the top five waits:


In [15]: topwaits = waitstats.ix[topwaits.index]

In [16]: topwaits.head(10)
Out[16]:
                                             waiting_tasks_count  wait_time_ms
wait_type        sampletime
ASYNC_NETWORK_IO 2014-03-25 11:19:20.613000                    0             0
                 2014-03-25 11:19:25.617000                  182           999
                 2014-03-25 11:19:30.620000                    0             0
                 2014-03-25 11:19:35.623000                  182           871
                 2014-03-25 11:19:40.627000                  182           914
                 2014-03-25 11:19:45.630000                    2             0
                 2014-03-25 11:19:50.633000                    0             0
                 2014-03-25 11:19:55.637000                  182           877
                 2014-03-25 11:20:00.640000                    0             0
                 2014-03-25 11:20:05.643000                    0             0

[10 rows x 2 columns]

Now we’re almost at the end of this analysis. I want to pivot the wait_type and show the waiting_task_count as the value. The unstack command is used for this:

In [17]: pivotted = topwaits.unstack(['wait_type'])['waiting_tasks_count']

In [18]: pivotted
Out[18]:
wait_type                  ASYNC_NETWORK_IO LATCH_EX OLEDB PAGEIOLATCH_SH SLEEP_TASK
sampletime
2014-03-25 11:19:20.613000                0        0     0              0         27
2014-03-25 11:19:25.617000              182      649  1524            139         26
2014-03-25 11:19:30.620000                0        0     0              0         28
2014-03-25 11:19:35.623000              182        0     0              0         26
2014-03-25 11:19:40.627000              182        0     0              0         27
2014-03-25 11:19:45.630000                2        0     0              0         26
2014-03-25 11:19:50.633000                0        0     0              0         26
2014-03-25 11:19:55.637000              182        0     0              0         27
2014-03-25 11:20:00.640000                0      642  1524            139         26
2014-03-25 11:20:05.643000                0        0     0              0         27
2014-03-25 11:20:10.647000                0       82   860             99         26
2014-03-25 11:20:15.650000                0        0     0              0         27
2014-03-25 11:20:20.653000                0        0     0              0         26
2014-03-25 11:20:25.657000              182        0     0              0         27
2014-03-25 11:20:30.660000                0        0     0              0         27
2014-03-25 11:20:35.663000                0        0     0              0         26
2014-03-25 11:20:40.667000                0      151   794             73         27
2014-03-25 11:20:45.670000                0        0     0              0         26
2014-03-25 11:20:50.673000                2        0     0              0         27

[19 rows x 5 columns]

And last of all we want to plot the values. The matplotlib module is the one to use here:


In [19]: import matplotlib.pyplot as plt

In [20]: pivotted.plot()
Out[20]: <matplotlib.axes.AxesSubplot at 0xbf81198>

In [21]: plt.show()

Here’s the graph:

waitstats

There are probably more efficient ways to convert the cumulative stats to deltas than the approach I’ve shown above. However, this isn’t really the point of the blog post. The intention is to show just how much manipulation can be done using the pandas module. I’ve only touched on some of the commands. There is comprehensive documentation at http://pandas.pydata.org/pandas-docs/stable/index.html

In terms of use cases, areas that come to mind are any form of time series analysis, extract, transform, load scenarios, aggregations (eg rolling averages, rolling sums etc), fast generation of randomised test data etc etc. Because of the cross platform nature of Python it’s ideally suited for a scenario I mentioned at the start of this post. NoSQL type databases typically lack some of the features of relational databases. For example, Cassandra doesn’t allow joins between tables or complex aggregations. Architects typically take this into consideration by incorporating denormalisation into the data model and using Hadoop/Hive for complex reporting . With the Python CQL module providing the equivalent of the ODBC connectivity to Cassandra and the pandas module providing the data analysis function you can see that what is seen as limitations in the NoSQL database can be overcome by solutions available in the middleware.

If you are a SQL Server person who thinks that the pandas module may be of interest here is the definitive reference from the author of the module:

http://www.amazon.co.uk/Python-Data-Analysis-Wrangling-IPython/dp/1449319793

The content assumes very little previous knowledge of Python. Prior to dipping into this reference I myself knew hardly any of the language constructs yet I’m now able to understand many of the questions that arise on forums such as StackOverflow.

Leave a comment

Setting up and using file tables

Certain scenarios may require your SQL Server to interact with files on its file system. For example, some external monitoring software may write to local files. You may want your SQL Server to interrogate the contents of these files. Alternatively you may want SQL Server to act on changes made to files ie to act as a type of file watcher. In the past you may have used xp_cmdshell to implement such functionality. But now with file table functionality in SQL Server 2012 (included as a standard edition feature) implementations become simpler and much more integrated with the server as a whole. The following describes my setup for file tables and some test results:

My database setup was as follows:


CREATE DATABASE [FileTableDB]
CONTAINMENT = NONE
ON  PRIMARY
( NAME = N'FileTableDB_Data', FILENAME = N'C:\Databases\FileTableDB\FileTableDB.mdf' , SIZE = 4288KB , MAXSIZE = UNLIMITED, FILEGROWTH = 1024KB ),
FILEGROUP [FilestreamFG] CONTAINS FILESTREAM  DEFAULT
( NAME = N'FilestreamData', FILENAME = N'C:\Databases\FileTableDB\FileTableDBFS' , MAXSIZE = UNLIMITED)
LOG ON
( NAME = N'FileTableDB_Log', FILENAME = N'C:\Databases\FileTableDB\FileTableDB.ldf' , SIZE = 1024KB , MAXSIZE = 2048GB , FILEGROWTH = 10%)
GO

ALTER DATABASE [FileTableDB] SET FILESTREAM( NON_TRANSACTED_ACCESS = FULL, DIRECTORY_NAME = N'FileTable' )
GO

My file table was created as follows:


USE [FileTableDB]
GO

CREATE TABLE [dbo].[ErrorLogs] AS FILETABLE ON [PRIMARY] FILESTREAM_ON [FilestreamFG]
WITH
(
FILETABLE_DIRECTORY = N'ErrorLogFolder', FILETABLE_COLLATE_FILENAME = Latin1_General_CI_AS
)
GO

So the file system folder that corresponded to the ErrorLogs file table was \\MyServerName\mssqlserver\FileTable\ErrorLogFolder

For my testing I wanted to implement a type of file system watcher, a solution which would capture the activity happening in \\MyServerName\mssqlserver\FileTable\ErrorLogFolder. I wanted the captured information to be saved in a conventional database table. This is the definition for that table:


USE [FileTableDB]
GO

CREATE TABLE [dbo].[ErrorsReceived](
[id] [int] IDENTITY(1,1) NOT NULL,
[triggeredby] [varchar](10) NOT NULL,
[filename] [nvarchar](255) NOT NULL,
[errortext] [nvarchar](max) NOT NULL,
PRIMARY KEY CLUSTERED
([id] ASC)
)
GO

This table would be populated via the firing of triggers against the file table. These are the trigger definitions:


CREATE TRIGGER [dbo].[triErrorLogs]
ON [dbo].[ErrorLogs]
FOR INSERT
AS
BEGIN
INSERT INTO dbo.ErrorsReceived ([triggeredby],[filename],[errortext])
SELECT 'INSERT',name,
CASE WHEN SUBSTRING(file_stream,1,2) = 0xFFFE
THEN CONVERT(NVARCHAR(max),file_stream)
ELSE CONVERT(VARCHAR(max),file_stream)
END AS errortext
FROM inserted
END
GO

CREATE TRIGGER [dbo].[truErrorLogs]
ON [dbo].[ErrorLogs]
FOR UPDATE
AS
BEGIN
INSERT INTO dbo.ErrorsReceived ([triggeredby],[filename],[errortext])
SELECT 'UPDATE',name,
CASE WHEN SUBSTRING(file_stream,1,2) = 0xFFFE
THEN CONVERT(NVARCHAR(max),file_stream)
ELSE CONVERT(VARCHAR(max),file_stream)
END AS errortext
FROM inserted
END
GO

CREATE TRIGGER [dbo].[trdErrorLogs]
ON [dbo].[ErrorLogs]
FOR DELETE
AS
BEGIN
INSERT INTO dbo.ErrorsReceived ([triggeredby],[filename],[errortext])
SELECT 'DELETE',name,
CASE WHEN SUBSTRING(file_stream,1,2) = 0xFFFE
THEN CONVERT(NVARCHAR(max),file_stream)
ELSE CONVERT(VARCHAR(max),file_stream)
END AS errortext
FROM deleted
END
GO

With the setup completed I create a file named server1.txt containing the line ‘2014-04-02 16:10:server1:error text from this server’ and copy this file into the file table folder. This should result in an insert action against the file table. Here are the actual results:

filetable1

The first result set is from the file table. You can see the file name, its type and size and the file_stream column that contains the actual data. The data for the second result set comes from the firing of the triggers. You can see that the file copy operation is captured as an insert followed by an update.

I’ll now edit server1.txt and save as a unicode encoded text file. Here is the result:

filetable2

 

See the doubling of the cached_file_size and the file_stream column with its 0xFFFE header indicating UNICODE as well as the interleaved 00 in the double byte character representation. The trigger code uses the 0xFFFE to determine whether the file_stream data is unicode or not and does the appropriate conversion before saving to the unicode column in the ErrorsReceived table.

Finally I’ll delete file:

filetable3

You see the file table is now empty and the delete trigger fired to save its results.

In summary, these simple tests shows that you can implement a file watcher using the file table functionality. Triggers can be created against the file table and the file content residing in the file_stream column can be parsed for further processing.

Leave a comment