Programmatically generating charts

Excel is all well and good when you want to manually plot your data. You manually load your data, create a bar chart, format axes, change the alignment for labels, add a title and save as a file for your team to review. But in an age where everyone is using tools such as Powershell or Python for their administration tasks it makes sense to use a scripting tool to generate charts. After all, as data professionals we all know how to programmatically read from a CSV file or from an SQL Server table. This blog post shows just how easily and elegantly you can generate a publication quality chart using the ggplot2 library in R.

R is an established data analysis tool. There are a weath of libraries for all sorts of statistical analysis. Many primitives that exist in SQL Server have counterparts in R. For example, a table can be represented as a data frame. In terms of data frame operations (ie selecting, filtering, grouping, ordering, windowing functions) a library that implements most of the SQL type operations is the dplyr library. This blog post does not concern dplyr but one its authors (Hadley Wickham) has also authored the ggplot2 library. This library implements “the grammar of graphics” functionality. You programmatically generate your chart using the functions within this library. An example making use of this library now follows.

Suppose you have the following CSV file containing summary results for eight different SQL Server configurations. The raw data for each configuration consists of average transactions per second, end-to-end total batch time, disk service time for the transaction log volume, average seconds per write for the transaction log volume:

config,avgtps,totalbatchtime,disktime,avgdisk
instance2,375,7.33,1.27,0.414
instance3highcpuplan,794,4.08,1.00,0.333
instance4,391,8.01,1.28,0.426
instance5,296,10.6,1.76,0.585
instance6,351,8.95,1.99,0.658
instance2highcpuplan,822,4.04,0.9896,0.325
instance1san,589,5.42,1.10,0.365
instance1ssd,939,3.49,0.10,0.038

You want to visualise the effect of SSD on reducing total batch time compared to the effect of the high performance CPU power plan.

Assuming you have R and RStudio installed on your workstation this is all the code you need to generate a publication quality chart for the results:

# Install the packages if this is the first time they are being used by the user

# install.packages('dplyr')
# install.packages('ggplot2')
# install.packages('reshape2')

# Use the libraries
library(dplyr)
library(ggplot2)
library(reshape2)

# Read the raw data into a native R data frame
# You could just as easily read the raw data from SQL Server using the RODBC library (see a future blog post)
rawdata<-read.csv('c:\\temp\\data.csv')

# Load the raw data into a dplyr data frame and display the "wide" format results
df<-tbl_df(rawdata)
df

# The df data frame holds its contents in what's know as "wide" format. That is, with columns for the four measures.
# Charting works most naturally with "narrow" format data so we unpivot the results using the reshape2 melt function (author Hadley Wickham again)
plotdata <- melt(df, id = c("config"), measure = c("avgtps", "totalbatchtime","disktime","avgdisk"))

# Display the "narrow" format results. This is format that ggplot will operate on
plotdata

# Plot the results
ggplot(plotdata, aes(x = config, y = value)) +
geom_bar(stat="identity") +
facet_wrap(~variable,scales="free_y",ncol=1) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
geom_text(aes(label = value, y = value, vjust=-1), size = 3)

The output is as follows:

df in “wide” format:

Source: local data frame [8 x 5]

                config avgtps totalbatchtime disktime avgdisk
1            instance2    375           7.33   1.2700   0.414
2 instance3highcpuplan    794           4.08   1.0000   0.333
3            instance4    391           8.01   1.2800   0.426
4            instance5    296          10.60   1.7600   0.585
5            instance6    351           8.95   1.9900   0.658
6 instance2highcpuplan    822           4.04   0.9896   0.325
7         instance1san    589           5.42   1.1000   0.365
8         instance1ssd    939           3.49   0.1000   0.038

plotdata in “narrow” format:

                 config       variable    value
1             instance2         avgtps 375.0000
2  instance3highcpuplan         avgtps 794.0000
3             instance4         avgtps 391.0000
4             instance5         avgtps 296.0000
5             instance6         avgtps 351.0000
6  instance2highcpuplan         avgtps 822.0000
7          instance1san         avgtps 589.0000
8          instance1ssd         avgtps 939.0000
9             instance2 totalbatchtime   7.3300
10 instance3highcpuplan totalbatchtime   4.0800
11            instance4 totalbatchtime   8.0100
12            instance5 totalbatchtime  10.6000
13            instance6 totalbatchtime   8.9500
14 instance2highcpuplan totalbatchtime   4.0400
15         instance1san totalbatchtime   5.4200
16         instance1ssd totalbatchtime   3.4900
17            instance2       disktime   1.2700
18 instance3highcpuplan       disktime   1.0000
19            instance4       disktime   1.2800
20            instance5       disktime   1.7600
21            instance6       disktime   1.9900
22 instance2highcpuplan       disktime   0.9896
23         instance1san       disktime   1.1000
24         instance1ssd       disktime   0.1000
25            instance2        avgdisk   0.4140
26 instance3highcpuplan        avgdisk   0.3330
27            instance4        avgdisk   0.4260
28            instance5        avgdisk   0.5850
29            instance6        avgdisk   0.6580
30 instance2highcpuplan        avgdisk   0.3250
31         instance1san        avgdisk   0.3650
32         instance1ssd        avgdisk   0.0380

 

The bar charts making use of the facets command to show each measure as a separate chart and “free_y” scales to show independent y axes.  Also, 90 degree rotation of x axis labels and measure value placed above each bar.

 

Example ggplot chart

 

Visualising the results you can see that although instance1 with SSD had dramatically reduced transaction log latency compared to instance1 on SAN in terms of reduction of total batch time the bigger “bang for the buck” was seen for instance 2 going from a balance CPU power plan to a high performance CPU power plan. The total batch time went down by almost 3.3 seconds and the average transactions per second more than doubled. This is interesting but is not the point of this blog post.

Using the “grammar of graphics” approach implemented in the ggplot2 library you have complete programmatic control of the elements you wish to see presented and the style of each element. And the great thing is that you can send not just the image to your end users but the command you used to generate the image as well. This command is a piece of script so can be treated as source code that you can version control and maintain history. You can’t do any of this for an Excel based solution.

You don’t need to be an R coding expert to use the scripting language. The learning curve is far lower than that for Powershell. And since the language is a data language and the operations that are performed tend to be vector and matrix type operations it’s a natural fit for those professionals who are already familiar with set based programming with SQL.

With Microsoft’s acquisition of Revolution Analytics there’s a high chance that R will become more prominent in the toolset that Microsoft provide to SQL Server professionals.

 

Advertisements
  1. Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: