Delayed durability data loss demonstration

This MSDN article, http://msdn.microsoft.com/en-gb/library/dn449490.aspx, makes it clear under the section ‘When can I lose data ? SQL Server shutdown and restart’ that a normal shutdown and restart can result in missing data. That is, client applications may receive acknowledgement that a transaction has been committed but because of the deferred log writer activity the transaction log may not have been persisted to the transaction log file. The following scripts  demonstrate the data loss that you could encounter.

First I’ll show that a database with delayed durability disabled does not suffer data loss. My testing database is named DelayedDurabilityTest. I will be inserting data into a table named MyData:

CREATE TABLE DelayedDurabilityTest.dbo.MyData
(
id BIGINT NOT NULL IDENTITY,
val VARCHAR(255) NOT NULL
)

For the first test I set delayed durability to disabled:

ALTER DATABASE [DelayedDurabilityTest] SET DELAYED_DURABILITY = DISABLED WITH NO_WAIT

Now in one query window run this to start populating the table:

-- Insert script - best to run this using SQLCMD - see the very end of this postSET NOCOUNT ON
WHILE 1=1 
BEGIN
INSERT INTO DelayedDurabilityTest.dbo.MyData (val) VALUES ('Hello !!')
PRINT SCOPE_IDENTITY()
END

 

 

While this is running start a new query window and run this:

SHUTDOWN WITH NOWAIT

 

 

Switch back to the window that was running the inserts and note down the last printed identity value. For my example, this was

...16248
16249
16250
16251
16252
Msg 0, Level 11, State 0, Line 0
A severe error occurred on the current command.  The results, if any, should be discarded.
Msg 0, Level 20, State 0, Line 0
A severe error occurred on the current command.  The results, if any, should be discarded.

 

So in my case the last printed identity value was 16252.

Now restart the SQL Server and run

SELECT COUNT(*) FROM DelayedDurabilityTest.dbo.MyData

 

In my case this returned a value of 16253. That fact that it is one higher than the printed value in the window that was doing the inserts simply means the printing of output was lagging the actual inserting of data. The important point is that there is no data loss.

Now we truncate the table, set delayed durability to FORCED and rerun the inserting window and then the shutdown window:


TRUNCATE TABLE DelayedDurabilityTest.dbo.MyData
<pre>ALTER DATABASE [DelayedDurabilityTest] SET DELAYED_DURABILITY = FORCED WITH NO_WAIT</pre>

This time the output of the inserting window was this:


18554
18555
18556
18557
18558
18559
Msg 109, Level 20, State 0, Line 0
A transport-level error has occurred when receiving results from the server. (provider: Shared Memory Provider, error: 0 - The pipe has been ended.)

So the last printed identity value was 18559. As was previously the case there may well be more rows in the table. The slow output of the printed messages may be under representing the actual rows. However, compare this to the row count in the table when the SQL Server was restarted. It was 18512. So in this case it’s actually lower than the last inserted value. So this has shown that with delayed durability enabled a SQL Server shutdown has resulted in data loss.

When running the scripts it may take several attempts to return a case where the row count is lower than the print message for the last row inserted. Using SQLCMD to run the insert script you’ll have more occurrences where the last printed identity is higher than the actual row count in the table when SQL Server is restarted.


SQLCMD -S localhost -i insertscript.sql


			

Leave a comment

Time spent in app, time spent in SQL Server

Application teams will sometimes believe they have a database issue when a recurring feed starts to perform poorly. Typically they will call on the DBA to perform a trace to identify the poorest performing SQL calls. When the trace is run and the analysis completed you may find that there are no obvious SQL issues. All the statements perform well. The performance problem may not necessarily reside on the database tier. It may reside on the application tier. It may be that the application tier is doing a large amount of number crunching or string processing between the recurring SQL calls. It may be the case that these routines have suddenly become the slow step in the feed.

Having performed the SQL trace it would be great if, in addition to the duration of the SQL calls, we could determine the elapsed time between SQL calls. One method for obtaining this figure is by subtracting the start time of an SQL call from the end time of the previous SQL call. If you export the trace data to a table and filter on, for example, the application name, SPID, hostname or proc name that is doing the load and order by start time you can use the LAG function to obtain the elapsed time between SQL calls.

In the following trace code the procedure that is performing the load is named myproc. The last two columns in the final result set show the total time spent doing SQL work and the total time spent in the application between the SQL calls.


WITH ctetracetimes AS
(
SELECT
starttime,
endtime,
LAG(endtime) OVER (ORDER BY starttime) AS previousendtime
FROM dbo.tracetbl
WHERE objectname = 'myproc'
),
cteappandsqltimes AS
(
SELECT
starttime,
endtime,
previousendtime,
DATEDIFF(millisecond,starttime,endtime) AS timespentsql,
DATEDIFF(millisecond,previousendtime,starttime) AS timespentapp
FROM ctetracetimes
)
SELECT *,
SUM(timespentsql) OVER() AS sumtimespentsql,
SUM(timespentapp) OVER() AS sumtimespentapp
FROM cteappandsqltimes
ORDER BY starttime

The results for repeated execution of the myproc stored procedure were as follows:

timespentsqlandapp

You can see that 50,000 milliseconds was spent doing the SQL related work (processing the stored procedure ten times) but over 150,000 milliseconds was spent in the application tier in between the SQL calls.

This type of analysis may be quite enlightening even for what you consider to be your well performing feeds. You may be making the assumption that the vast majority of your feed time is associated with database processing when, in fact, it is associated with application processing time. For example, one of my own feeds had a total elapsed time of twenty minutes. When the above analysis was performed the SQL time was only two minutes. Eighteen minutes was being spent in the application tier. A poorly performing number crunching routine was identified. It was then optimised to reduce the overall run-time of the feed.

Leave a comment

LAG and LEAD for SQL Server 2008

The LAG and LEAD analytical functions in SQL Server 2012 are very useful. Unfortunately, these functions don’t exist prior to the 2012 version. Most pre-2012 solutions resort to using self joins to identify the lagging or leading row. Here’s a solution that references the source table once:


WITH balance_details
AS
(
SELECT * FROM (
VALUES
('Tom','20140101',100),
('Tom','20140102',120),
('Tom','20140103',150),
('Tom','20140104',140),
('Tom','20140105',160),
('Tom','20140106',180),
('Jerry','20140101',210),
('Jerry','20140102',240),
('Jerry','20140103',230),
('Jerry','20140104',270),
('Jerry','20140105',190),
('Jerry','20140106',200),
('David','20140101',170),
('David','20140102',230),
('David','20140103',240),
('David','20140104',210),
('David','20140105',160),
('David','20140106',200)
) AS t (customer, balancedate,balance)
),
balance_cte AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY customer ORDER BY balancedate) rn
,(ROW_NUMBER() OVER (PARTITION BY customer ORDER BY balancedate))/2 rndiv2
,(ROW_NUMBER() OVER (PARTITION BY customer ORDER BY balancedate) + 1)/2 rnplus1div2
/*
,COUNT(*) OVER (PARTITION BY customer) partitioncount
*/
,customer
,balancedate
,balance
FROM balance_details
)
SELECT
rn
,rndiv2
,rnplus1div2
,customer,balancedate,balance
,CASE WHEN rn%2=1 THEN MAX(CASE WHEN rn%2=0 THEN balance END) OVER (PARTITION BY customer,rndiv2) ELSE MAX(CASE WHEN rn%2=1 THEN balance END) OVER (PARTITION BY customer,rnplus1div2)
END AS balance_lag
,CASE WHEN rn%2=1 THEN MAX(CASE WHEN rn%2=0 THEN balance END) OVER (PARTITION BY customer,rnplus1div2) ELSE MAX(CASE WHEN rn%2=1 THEN balance END) OVER (PARTITION BY customer,rndiv2)
END AS balance_lead
/*
,MAX(CASE WHEN rn=1 THEN balance END) OVER (PARTITION BY customer) AS first_value
,MAX(CASE WHEN rn=partitioncount THEN balance END) OVER (PARTITION BY customer) AS last_value
,MAX(CASE WHEN rn=4 THEN balance END) OVER (PARTITION BY customer) AS fourth_value
*/
FROM balance_cte
ORDER BY customer,balancedate

The results are as follows:

laglead

(The commented out code shows how to obtain the equivalent of the SQL 2012 FIRST_VALUE and LAST_VALUE. I also show how to obtain the Nth value where I’ve specified N as 4)

Regarding the LAG code it relies on the fact that integer division by 2 of the ROW_NUMBER can be used in the partitioning function. For example, 2 and 3 div 2 give 1. When the the current row is 3 (ie rn mod 2 is 1) we examine the window frame rows where rndiv2 is 1 and choose the balance value where the frame row is even (ie rn mod 2 is 0). That is, the value for row 2. When the current row is 2 (ie rn mod 2 is 0) we examine the window frame rows where rnplus1div2 is 1 and choose the balance value where the frame row is odd (ie rn mod 2 is 1). That is, the value for row 1.

For LEAD the priniciple is the same, we just use the opposite ‘division by 2’ expression in the partition clause.

Leave a comment

Finding the position of a subsequence in a larger sequence

This post is inspired by this article: http://sqlmag.com/t-sql/identifying-subsequence-in-sequence-part-2

Particularly the very elegant solution posted by Pesomannen in the Discussion section. I’m going to translate this SQL solution into an equivalent Python pandas solution.

Instead of using integers as the sequence values I’m going to use lower case letters. Also, rather than having one column in which I search for the required subsequence I’m going to search in two columns for a specified pair of subsequences. The two searched columns are made up of 200,000 random values consisting of the letters a to f. The subsequence I’m going to search for are two columns of four letters. Specifically, the two subsequences made up of val1 =  (‘e’,’a’,’d’,’b’) and val2 = (‘a’,’d’,’c’,’e’). Here’s the solution:


In [1]:import pandas as pd

In [2]:import numpy as np

In [3]:import string

In [4]:letters = np.asarray(list(string.letters), dtype=object)

In [5]:T1 = pd.DataFrame(dict(val1=letters[:6][np.random.randint(0, 6, 200000)],val2=letters[:6][np.random.randint(0, 6, 200000)]))

In [6]:T1 = T1.reset_index()

In [7]:T1[:5]

Out[7]:
      index val1  val2
0     0     e     c
1     1     f     a
2     2     d     f
3     3     c     e
4     4     e     e

I’m showing the first five rows out of a total of 200,000 rows. I’m going to search for this subsequence:


In [8]:P = pd.DataFrame(data = {'val1': ['e','a','d','b'],'val2': ['a','d','c','e']})

In [9]:P = P.reset_index()

In [10]:P

Out[10]:
      index val1  val2
0     0     e     a
1     1     a     d
2     2     d     c
3     3     b     e

The first step is to join the subsequence to the 200,000 rows matching on val1 and val2:


In [11]:J = T1.merge(P,on=['val1','val2'],how='inner')

In [12]:J

Out[12]:
     index_x val1  val2  index_y
0     22     b     e     3
1     41     b     e     3
2     69     b     e     3
3     81     b     e     3
4     221     b     e     3
5     237     b     e     3
6     275     b     e     3
7     348     b     e     3
8     453     b     e     3
9     507     b     e     3

I’m showing the first 10 matching rows.
Now the elegant SQL solution is based on the fact that if you group by the difference in the matching keys the subsequence matches occur when the group by counts equal the length of the subsequence. This translates to the following:
In [13]:FullResult = J.groupby(J['index_x'] - J['index_y'])['index_x'].agg({min,max,'count'})

In [14]:FullResult[FullResult['count'] == 4]

Out[14]:
           count min        max
180767     4     180767     180770
So this result tells up that the subsequence is found in positions 180767 to 180770. Let’s check but interrogating the original 200,000 rows between these key positions:
In [15]:T1[180760:180775]

Out[15]:
           index      val1  val2
180760     180760     a     c
180761     180761     c     f
180762     180762     e     e
180763     180763     d     e
180764     180764     d     b
180765     180765     d     f
180766     180766     e     d
180767     180767     e     a    <----
180768     180768     a     d    <----
180769     180769     d     c    <----
180770     180770     b     e    <----
180771     180771     a     f
180772     180772     e     f
180773     180773     f     c
180774     180774     f     d

 

You see the matching subsequence in positions 180767 to 180770.

The point of this post is to show there are tools other that SQL that can be used to solve such puzzles. The Python pandas module provides an SQL-like set of operations for joining data frames. These operations are not that difficult to grasp if you come from an SQL background.

Leave a comment

Excessive logging pre-SQL 2012 when changing variable length columns from NULL to NOT NULL

In pre-SQL Server 2012 versions you may encounter excessive transaction logging when you use the ALTER TABLE ALTER COLUMN command to change NULL to NOT NULL. Depending on the record size the amount of transaction log space used could be over three times the size of the table you are altering. This post shows that this is specifically an issue when changing variable length columns from NULL to NOT NULL. The issue does not occur for the same change applied to fixed length columns.

The basis of this post is this StackExchange thread: http://dba.stackexchange.com/questions/29522/why-does-alter-column-to-not-null-cause-massive-log-file-growth

The thread pretty much gives the explanation to the problem but the point about fixed versus variable length is not entirely clear. Here’s a test script which explains the differences:


IF OBJECT_ID('dbo.testalternotnull') IS NOT NULL DROP TABLE dbo.testalternotnull;

CREATE TABLE dbo.testalternotnull
(
id int NOT NULL,
col varchar(3) NULL
);
GO

INSERT INTO dbo.testalternotnull VALUES (1,'abc')
GO 10

CHECKPOINT;

ALTER TABLE dbo.testalternotnull ALTER COLUMN col varchar(3) NOT NULL;

SELECT * FROM fn_dblog(null,null);

You can see the ten LOP_MODIFY_ROW entries where the AllocUnitName is dbo.testalternotnull

altertonotnull1

Also note the log reserve figures.

Now let’s check the case where we are changing from NOT NULL to NULL:


IF OBJECT_ID('dbo.testalternotnull') IS NOT NULL DROP TABLE dbo.testalternotnull;

CREATE TABLE dbo.testalternotnull
(
id int NOT NULL,
col varchar(3) NOT NULL
);
GO

INSERT INTO dbo.testalternotnull VALUES (1,'abc')
GO 10

CHECKPOINT;

ALTER TABLE dbo.testalternotnull ALTER COLUMN col varchar(3) NULL;

SELECT * FROM fn_dblog(null,null);

altertonotnull2

No reference to the table name here so this is just a metadata change.

Now let’s check the results for a fixed length column changing NULL to NOT NULL:


IF OBJECT_ID('dbo.testalternotnull') IS NOT NULL DROP TABLE dbo.testalternotnull;

CREATE TABLE dbo.testalternotnull
(
id int NOT NULL,
col char(3) NULL
);
GO

INSERT INTO dbo.testalternotnull VALUES (1,'abc')
GO 10

CHECKPOINT;

ALTER TABLE dbo.testalternotnull ALTER COLUMN col char(3) NOT NULL;

SELECT * FROM fn_dblog(null,null);

altertonotnull3

No references to LOP_MODIFY_ROW where AllocUnitName is dbo.testalternotnull so for fixed length columns this is a metadata only change.

I won’t show the output for changing a CHAR(3) NOT NULL to NULL as the result is the same as that for VARCHAR(3) NOT NULL to NULL. That is, it is a metadata only change.

What about BIT columns ? Well given they are represented in the record layout as a byte packed structure they can be considered to be a variable type. Let’s check this by showing the output for changing a BIT NULL to a BIT NOT NULL:


IF OBJECT_ID('dbo.testalternotnull') IS NOT NULL DROP TABLE dbo.testalternotnull;

CREATE TABLE dbo.testalternotnull
(
id int NOT NULL,
col BIT NULL
);
GO

INSERT INTO dbo.testalternotnull VALUES (1,0)
GO 10

CHECKPOINT;

ALTER TABLE dbo.testalternotnull ALTER COLUMN col BIT NOT NULL;

SELECT * FROM fn_dblog(null,null);

altertonotnull4

So, yes, you can consider the BIT type to be a variable type and changing it from NULL to NOT NULL can result in high transaction log use.

In summary, this post shows examples of high, and what you may consider excessive, transaction log use when changing variable length columns from NULL to NOT NULL. The StackExchange thread referred to at the start of this post shows that the resulting size of the transaction log may be many times the size of the table being altered. Fortunately, for SQL Server 2012 and above most column changes appear to be metadata only changes.

Leave a comment

Delayed durability: performance boost for certain workloads

If some of your workloads are able to tolerate a certain amount of data loss and they are already bottlenecked on the write performance of the transaction log then SQL Server 2014’s delayed durability option will probably give you immediate performance boost. One use case that comes to mind is staging databases where ETL work is being performed. Some of the data feeds may be implemented as single row inserts as opposed to batch inserts. Within the application environment you may have no options available to convert to batch inserts (ie multiple inserts in one transaction). This post will show the performance improvements you could see by enabling delayed durability.

Single row inserts going into this table:


CREATE TABLE dbo.TestingData
(
id INT NOT NULL IDENTITY,
col1 INT NOT NULL DEFAULT 2000,
col2 VARCHAR(50) NOT NULL DEFAULT 'wxyz'
)

Test script inserting 10000 rows:


SET NOCOUNT ON
DECLARE @i INT = 1
WHILE @i <= 10000
BEGIN
INSERT INTO dbo.TestingData( col1, col2 ) VALUES (DEFAULT, DEFAULT)
SET @i = @i + 1
END

Toggle the delayed durability option using this:


ALTER DATABASE [DelayedDurabilityTest] SET DELAYED_DURABILITY = <FORCED or DISABLED> WITH NO_WAIT

Results inserting 10000 rows (captured using before and after sampling of sys.dm_io_virtual_file_stats):

             sampletime    file_id    reads    readbytes    bytesperread    writes    writebytes    bytesperwrite
2014-04-21 10:22:02.967    1          0        0            0               0         0             0
2014-04-21 10:22:02.967    2          0        0            0               10000     5154816       515
2014-04-21 11:00:01.947    1          0        0            0               0         0             0
2014-04-21 11:00:01.947    2          0        0            0               1733      4046336       2334

The test at 10:22 was conducted with delayed durability disabled (the default). File id 1 is the data file, file id 2 is the log file. The test at 11:00 was conducted with delayed durability set to forced. You can see the reduced number of writes to the transaction log and the higher figure for bytes per write. The sysinternals ProcessMonitor utility confirms these details. For the 10:22 test we see this:

delayeddurability1

The Sequence column starts at 0 so this is showing us 10000 writes each being 512 bytes (matches the results from sys.dm_io_virtual_file_stats). For the 11:00 test we see this:

delayeddurability2

The Sequence column started at 10000, it ends at 11732 so that’s 1733 write IOs matching the figure from the DMV. The Detail column shows the larger write sizes.

Following this simple test I changed 10000 to 500000 in the while loop that does the insert and captured the performance monitor counters for transactions/sec, log flushes and %processor time. Here are the results:

delayeddurability3

The execution took over five minutes with the option disabled and forty seconds when it was set to forced. You can see in black the improvement in transactions/sec

The actual transaction log content is identical for both executions (not surprising as the option simply controls when the log buffer is written to disk, not its content). Here are the results for two transaction entries using fn_dblog:

delayeddurability4

 

Delayed durability is available in all editions of SQL Server 2014. For workloads that can tolerate a certain amount of data loss you’ll probably notice an improvement in your transaction processing rate.

Leave a comment

Data analysis operations using the Python pandas module

Within the SQL Server world we tend to use tools such as raw SQL or MDX code, Excel, DAX in Power Pivot and may be even the M language in Power Query to do most of our data analysis work. Outside of these Microsoft orientated tools we may even use the R language which despite being around for a very long time is only now being recognised within the SQL Server business intelligence community. In this post I want to mention another data analysis tool which is feature rich, has a large user base (as a result of its parent language) and which can be used to bridge the gap between the Windows and SQL Server world and the other worlds which include technologies such as NoSQL. The data analysis tool that I’m referring to is the pandas module used within the Python language.

As is the case for R the Python language has a very large user base. The Python pandas module provides the DataFrame object which is analogous to the same object in R. SQL type operations such as join, merge, group by, pivot can all be performed on pandas DataFrames. I’ll illustrate the use of the pandas module using some very simple examples. I’m using the freely available Anaconda Python 2.7 (64 bit) distribution on Windows. You can obtain this from https://store.continuum.io/cshop/anaconda/.  I’m using the interactive IPython qtconsole to run all the code.

This first section of code shows how to connect to a SQL Server data source:


In [1]: import pandas as pd

In [2]: import pandas.io.sql as psql

In [3]: import pypyodbc

In [4]: conn = pypyodbc.connect('DRIVER={SQL Server};SERVER=mysqlserver;DATABASE=mydatabase;Trusted_Connection=yes')

In [5]: waitstats = psql.read_frame('SELECT sampletime,wait_type,waiting_tasks_count,wait_time_ms FROM dbo.capwaitstats', conn)

In [6]: waitstats.head(10)
Out[6]:
                  sampletime      wait_type  waiting_tasks_count  wait_time_ms
0 2014-03-25 11:19:15.610000  MISCELLANEOUS                    0             0
1 2014-03-25 11:19:15.610000    LCK_M_SCH_S                    0             0
2 2014-03-25 11:19:15.610000    LCK_M_SCH_M                   15             6
3 2014-03-25 11:19:15.610000        LCK_M_S                    5          1029
4 2014-03-25 11:19:15.610000        LCK_M_U                    0             0
5 2014-03-25 11:19:15.610000        LCK_M_X                    0             0
6 2014-03-25 11:19:15.610000       LCK_M_IS                    0             0
7 2014-03-25 11:19:15.610000       LCK_M_IU                    0             0
8 2014-03-25 11:19:15.610000       LCK_M_IX                    0             0
9 2014-03-25 11:19:15.610000      LCK_M_SIU                    0             0

[10 rows x 4 columns]

The pypyodbc module is not included in the Anaconda Python distribution but is very easy to install once Python is present. You simply run this to install it:

pip install pypyodbc

The actual execution above fetches data from an historical waitstats table. This table stores the waitstats in their raw form. That is, just like the DMV that they are sourced from the counts and times are cumulative figures since the start time of the SQL Server. The head command is used to show the first 10 rows of the dataframe. The first column (showing 0 to 9) represents the default index for the dataframe.

You’ll notice that I didn’t specify an ‘ORDER BY wait_type,sampletime’ clause in the SQL that fetches the data from the database. It’s best practice to include such a clause but if you haven’t or are unable to do so then you can sort the dataframe as shown next:

In [7]: waitstats.sort(['wait_type','sampletime'], inplace=True)
In [8]: waitstats.set_index(['wait_type','sampletime'],inplace=True)

In [9]: waitstats.ix['OLEDB']
Out[9]:                     waiting_tasks_count  wait_time_ms
sampletime                                                   
2014-03-25 11:19:15.610000                22733          1534
2014-03-25 11:19:20.613000                22733          1534
2014-03-25 11:19:25.617000                24257          1677
2014-03-25 11:19:30.620000                24257          1677
2014-03-25 11:19:35.623000                24257          1677
2014-03-25 11:19:40.627000                24257          1677
2014-03-25 11:19:45.630000                24257          1677
2014-03-25 11:19:50.633000                24257          1677
2014-03-25 11:19:55.637000                24257          1677
2014-03-25 11:20:00.640000                25781          1816
2014-03-25 11:20:05.643000                25781          1816
2014-03-25 11:20:10.647000                26641          1876
2014-03-25 11:20:15.650000                26641          1876
2014-03-25 11:20:20.653000                26641          1876
2014-03-25 11:20:25.657000                26641          1876
2014-03-25 11:20:30.660000                26641          1876
2014-03-25 11:20:35.663000                26641          1876
2014-03-25 11:20:40.667000                27435          1935
2014-03-25 11:20:45.670000                27435          1935
2014-03-25 11:20:50.673000                27435          1935

[20 rows x 2 columns]

After the sort I create a multi-index on wait_type and sampletime. I show the use of an index fetch using ix to select out one of the wait types.

As mentioned above the waitstats are cumulative figures. The dataframe will now be grouped by the wait_type index level and the diff command (equivalent of lead/lag) will be used to obtain the waitstat deltas:


In [10]: waitstats = waitstats.groupby(level=['wait_type']).diff()

In [11]: waitstats.ix['OLEDB']
Out[11]:
                            waiting_tasks_count  wait_time_ms
sampletime                                                   
2014-03-25 11:19:15.610000                  NaN           NaN
2014-03-25 11:19:20.613000                    0             0
2014-03-25 11:19:25.617000                 1524           143
2014-03-25 11:19:30.620000                    0             0
2014-03-25 11:19:35.623000                    0             0
2014-03-25 11:19:40.627000                    0             0
2014-03-25 11:19:45.630000                    0             0
2014-03-25 11:19:50.633000                    0             0
2014-03-25 11:19:55.637000                    0             0
2014-03-25 11:20:00.640000                 1524           139
2014-03-25 11:20:05.643000                    0             0
2014-03-25 11:20:10.647000                  860            60
2014-03-25 11:20:15.650000                    0             0
2014-03-25 11:20:20.653000                    0             0
2014-03-25 11:20:25.657000                    0             0
2014-03-25 11:20:30.660000                    0             0
2014-03-25 11:20:35.663000                    0             0
2014-03-25 11:20:40.667000                  794            59
2014-03-25 11:20:45.670000                    0             0
2014-03-25 11:20:50.673000                    0             0

[20 rows x 2 columns]

In [12]: waitstats = waitstats.dropna()

You can see that with the assignment statement I’m overwriting the original waitstats dataframe that held the cumulative figures with the results of the groupby holding just the deltas. The dropna command is used to remove the NaN rows.

Now to identify the top five waits based on the sum of the waiting_tasks_counts I use this:


In [13]: topwaits = waitstats.groupby(level=['wait_type'])['waiting_tasks_count'].sum().order(ascending=False)[:5]

In [14]: topwaits
Out[14]:
wait_type
OLEDB               4702
LATCH_EX            1524
ASYNC_NETWORK_IO     914
SLEEP_TASK           505
PAGEIOLATCH_SH       450
Name: waiting_tasks_count, dtype: float64

The wait_type index from the topwaits object can be used to index into the main waitstats dataframe to fetch the full data for the top five waits:


In [15]: topwaits = waitstats.ix[topwaits.index]

In [16]: topwaits.head(10)
Out[16]:
                                             waiting_tasks_count  wait_time_ms
wait_type        sampletime
ASYNC_NETWORK_IO 2014-03-25 11:19:20.613000                    0             0
                 2014-03-25 11:19:25.617000                  182           999
                 2014-03-25 11:19:30.620000                    0             0
                 2014-03-25 11:19:35.623000                  182           871
                 2014-03-25 11:19:40.627000                  182           914
                 2014-03-25 11:19:45.630000                    2             0
                 2014-03-25 11:19:50.633000                    0             0
                 2014-03-25 11:19:55.637000                  182           877
                 2014-03-25 11:20:00.640000                    0             0
                 2014-03-25 11:20:05.643000                    0             0

[10 rows x 2 columns]

Now we’re almost at the end of this analysis. I want to pivot the wait_type and show the waiting_task_count as the value. The unstack command is used for this:

In [17]: pivotted = topwaits.unstack(['wait_type'])['waiting_tasks_count']

In [18]: pivotted
Out[18]:
wait_type                  ASYNC_NETWORK_IO LATCH_EX OLEDB PAGEIOLATCH_SH SLEEP_TASK
sampletime
2014-03-25 11:19:20.613000                0        0     0              0         27
2014-03-25 11:19:25.617000              182      649  1524            139         26
2014-03-25 11:19:30.620000                0        0     0              0         28
2014-03-25 11:19:35.623000              182        0     0              0         26
2014-03-25 11:19:40.627000              182        0     0              0         27
2014-03-25 11:19:45.630000                2        0     0              0         26
2014-03-25 11:19:50.633000                0        0     0              0         26
2014-03-25 11:19:55.637000              182        0     0              0         27
2014-03-25 11:20:00.640000                0      642  1524            139         26
2014-03-25 11:20:05.643000                0        0     0              0         27
2014-03-25 11:20:10.647000                0       82   860             99         26
2014-03-25 11:20:15.650000                0        0     0              0         27
2014-03-25 11:20:20.653000                0        0     0              0         26
2014-03-25 11:20:25.657000              182        0     0              0         27
2014-03-25 11:20:30.660000                0        0     0              0         27
2014-03-25 11:20:35.663000                0        0     0              0         26
2014-03-25 11:20:40.667000                0      151   794             73         27
2014-03-25 11:20:45.670000                0        0     0              0         26
2014-03-25 11:20:50.673000                2        0     0              0         27

[19 rows x 5 columns]

And last of all we want to plot the values. The matplotlib module is the one to use here:


In [19]: import matplotlib.pyplot as plt

In [20]: pivotted.plot()
Out[20]: <matplotlib.axes.AxesSubplot at 0xbf81198>

In [21]: plt.show()

Here’s the graph:

waitstats

There are probably more efficient ways to convert the cumulative stats to deltas than the approach I’ve shown above. However, this isn’t really the point of the blog post. The intention is to show just how much manipulation can be done using the pandas module. I’ve only touched on some of the commands. There is comprehensive documentation at http://pandas.pydata.org/pandas-docs/stable/index.html

In terms of use cases, areas that come to mind are any form of time series analysis, extract, transform, load scenarios, aggregations (eg rolling averages, rolling sums etc), fast generation of randomised test data etc etc. Because of the cross platform nature of Python it’s ideally suited for a scenario I mentioned at the start of this post. NoSQL type databases typically lack some of the features of relational databases. For example, Cassandra doesn’t allow joins between tables or complex aggregations. Architects typically take this into consideration by incorporating denormalisation into the data model and using Hadoop/Hive for complex reporting . With the Python CQL module providing the equivalent of the ODBC connectivity to Cassandra and the pandas module providing the data analysis function you can see that what is seen as limitations in the NoSQL database can be overcome by solutions available in the middleware.

If you are a SQL Server person who thinks that the pandas module may be of interest here is the definitive reference from the author of the module:

http://www.amazon.co.uk/Python-Data-Analysis-Wrangling-IPython/dp/1449319793

The content assumes very little previous knowledge of Python. Prior to dipping into this reference I myself knew hardly any of the language constructs yet I’m now able to understand many of the questions that arise on forums such as StackOverflow.

Leave a comment