Archive for category R

Charting virtual file stats using R

This blog post shows how to retrieve data using R and chart the results in a number of ways. The example data consists of virtual file stats. The source is the virtual file stats DMV so the ‘measures’ are ever increasing values since the last restart of the SQL Server. The deltas could be calculated using SQL windowing functions but I show how to use the R LAG function to achieve the same result. The raw cumulative data is as follows:


The TestDB consists of four data files and the virtual file stats DMV was sampled every 10 minutes. The above shows 30 rows out of a total of 2304 rows (from 1st March to 4th March inclusive). I’m only interested in the ‘number of bytes read’ measure. The same analysis could be done for any of the other measure columns.

Here’s the R code to read from the table and chart the results:

# Use the install.packages lines if this is the first use of the libraries
# install.packages("RODBC")
# install.packages("dplyr")
# install.packages("ggplot2")
# install.packages("scales")


# Connect to the database
dbhandle <- odbcDriverConnect('driver={SQL Server};server=MySQLServer;database=TestDB;trusted_connection=true')

# Cast the sampletime as separate date and time columns
res <- sqlQuery(dbhandle, "select CAST(sampletime AS date) AS DateValue,CAST(sampletime AS time) AS TimeValue,physical_name,num_of_bytes_read from virtualfilestats order by physical_name,sampletime")

# Load results into a dplyr data frame

# Cast and format the time column
df$TimeValue<-as.POSIXct(df$TimeValue,format = "%H:%M:%S")

# Group the data frame by file name so the lag function works per file name.
# Also, ensure the proper ordering within each group so the delta values are calculated correctly using lag

# Use the lag function to calculate the deltas
df<-df%>%mutate(num_of_bytes_read=num_of_bytes_read - lag(num_of_bytes_read))

# One chart per day with colours representing the different file names
ggplot(df,aes(x=TimeValue,y=num_of_bytes_read,colour=physical_name)) + geom_line() + facet_grid(~DateValue) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + scale_x_datetime(breaks=date_breaks("1 hour"), labels=date_format("%H:%M")) + scale_y_continuous(labels = comma)

# One chart per day per file name. Independent y axis for each file name (scales = "free_y")
ggplot(df,aes(x=TimeValue,y=num_of_bytes_read)) + geom_line() + facet_grid(physical_name~DateValue,scales="free_y") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + scale_x_datetime(breaks=date_breaks("1 hour"), labels=date_format("%H:%M")) + scale_y_continuous(labels = comma)

# One chart per day per file name. Same scale for every y axis
ggplot(df,aes(x=TimeValue,y=num_of_bytes_read)) + geom_line() + facet_grid(physical_name~DateValue) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + scale_x_datetime(breaks=date_breaks("1 hour"), labels=date_format("%H:%M")) + scale_y_continuous(labels = comma)

And here are the three charts that result from the ggplot commands:

The first one showing a chart for each day:


The next two are better. The only difference between the two is the independent y axis for the first:



You can see that the most reads occur between 7am and 7pm with more data being read from data file 1 and 2 (perhaps because of proportional fill with different file sizes).

R scripting is great for this type of DBA work. You have full control of the elements that appear on your visualisations using the ggplot2 library. Coupled with this the dplyr library contains a wealth of windowing functions to deal with the most common data trends seen by DBAs.



Leave a comment

Programmatically generating charts

Excel is all well and good when you want to manually plot your data. You manually load your data, create a bar chart, format axes, change the alignment for labels, add a title and save as a file for your team to review. But in an age where everyone is using tools such as Powershell or Python for their administration tasks it makes sense to use a scripting tool to generate charts. After all, as data professionals we all know how to programmatically read from a CSV file or from an SQL Server table. This blog post shows just how easily and elegantly you can generate a publication quality chart using the ggplot2 library in R.

R is an established data analysis tool. There are a weath of libraries for all sorts of statistical analysis. Many primitives that exist in SQL Server have counterparts in R. For example, a table can be represented as a data frame. In terms of data frame operations (ie selecting, filtering, grouping, ordering, windowing functions) a library that implements most of the SQL type operations is the dplyr library. This blog post does not concern dplyr but one its authors (Hadley Wickham) has also authored the ggplot2 library. This library implements “the grammar of graphics” functionality. You programmatically generate your chart using the functions within this library. An example making use of this library now follows.

Suppose you have the following CSV file containing summary results for eight different SQL Server configurations. The raw data for each configuration consists of average transactions per second, end-to-end total batch time, disk service time for the transaction log volume, average seconds per write for the transaction log volume:


You want to visualise the effect of SSD on reducing total batch time compared to the effect of the high performance CPU power plan.

Assuming you have R and RStudio installed on your workstation this is all the code you need to generate a publication quality chart for the results:

# Install the packages if this is the first time they are being used by the user

# install.packages('dplyr')
# install.packages('ggplot2')
# install.packages('reshape2')

# Use the libraries

# Read the raw data into a native R data frame
# You could just as easily read the raw data from SQL Server using the RODBC library (see a future blog post)

# Load the raw data into a dplyr data frame and display the "wide" format results

# The df data frame holds its contents in what's know as "wide" format. That is, with columns for the four measures.
# Charting works most naturally with "narrow" format data so we unpivot the results using the reshape2 melt function (author Hadley Wickham again)
plotdata <- melt(df, id = c("config"), measure = c("avgtps", "totalbatchtime","disktime","avgdisk"))

# Display the "narrow" format results. This is format that ggplot will operate on

# Plot the results
ggplot(plotdata, aes(x = config, y = value)) +
geom_bar(stat="identity") +
facet_wrap(~variable,scales="free_y",ncol=1) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
geom_text(aes(label = value, y = value, vjust=-1), size = 3)

The output is as follows:

df in “wide” format:

Source: local data frame [8 x 5]

                config avgtps totalbatchtime disktime avgdisk
1            instance2    375           7.33   1.2700   0.414
2 instance3highcpuplan    794           4.08   1.0000   0.333
3            instance4    391           8.01   1.2800   0.426
4            instance5    296          10.60   1.7600   0.585
5            instance6    351           8.95   1.9900   0.658
6 instance2highcpuplan    822           4.04   0.9896   0.325
7         instance1san    589           5.42   1.1000   0.365
8         instance1ssd    939           3.49   0.1000   0.038

plotdata in “narrow” format:

                 config       variable    value
1             instance2         avgtps 375.0000
2  instance3highcpuplan         avgtps 794.0000
3             instance4         avgtps 391.0000
4             instance5         avgtps 296.0000
5             instance6         avgtps 351.0000
6  instance2highcpuplan         avgtps 822.0000
7          instance1san         avgtps 589.0000
8          instance1ssd         avgtps 939.0000
9             instance2 totalbatchtime   7.3300
10 instance3highcpuplan totalbatchtime   4.0800
11            instance4 totalbatchtime   8.0100
12            instance5 totalbatchtime  10.6000
13            instance6 totalbatchtime   8.9500
14 instance2highcpuplan totalbatchtime   4.0400
15         instance1san totalbatchtime   5.4200
16         instance1ssd totalbatchtime   3.4900
17            instance2       disktime   1.2700
18 instance3highcpuplan       disktime   1.0000
19            instance4       disktime   1.2800
20            instance5       disktime   1.7600
21            instance6       disktime   1.9900
22 instance2highcpuplan       disktime   0.9896
23         instance1san       disktime   1.1000
24         instance1ssd       disktime   0.1000
25            instance2        avgdisk   0.4140
26 instance3highcpuplan        avgdisk   0.3330
27            instance4        avgdisk   0.4260
28            instance5        avgdisk   0.5850
29            instance6        avgdisk   0.6580
30 instance2highcpuplan        avgdisk   0.3250
31         instance1san        avgdisk   0.3650
32         instance1ssd        avgdisk   0.0380


The bar charts making use of the facets command to show each measure as a separate chart and “free_y” scales to show independent y axes.  Also, 90 degree rotation of x axis labels and measure value placed above each bar.


Example ggplot chart


Visualising the results you can see that although instance1 with SSD had dramatically reduced transaction log latency compared to instance1 on SAN in terms of reduction of total batch time the bigger “bang for the buck” was seen for instance 2 going from a balance CPU power plan to a high performance CPU power plan. The total batch time went down by almost 3.3 seconds and the average transactions per second more than doubled. This is interesting but is not the point of this blog post.

Using the “grammar of graphics” approach implemented in the ggplot2 library you have complete programmatic control of the elements you wish to see presented and the style of each element. And the great thing is that you can send not just the image to your end users but the command you used to generate the image as well. This command is a piece of script so can be treated as source code that you can version control and maintain history. You can’t do any of this for an Excel based solution.

You don’t need to be an R coding expert to use the scripting language. The learning curve is far lower than that for Powershell. And since the language is a data language and the operations that are performed tend to be vector and matrix type operations it’s a natural fit for those professionals who are already familiar with set based programming with SQL.

With Microsoft’s acquisition of Revolution Analytics there’s a high chance that R will become more prominent in the toolset that Microsoft provide to SQL Server professionals.


Leave a comment