Excel is all well and good when you want to manually plot your data. You manually load your data, create a bar chart, format axes, change the alignment for labels, add a title and save as a file for your team to review. But in an age where everyone is using tools such as Powershell or Python for their administration tasks it makes sense to use a scripting tool to generate charts. After all, as data professionals we all know how to programmatically read from a CSV file or from an SQL Server table. This blog post shows just how easily and elegantly you can generate a publication quality chart using the ggplot2 library in R.
R is an established data analysis tool. There are a weath of libraries for all sorts of statistical analysis. Many primitives that exist in SQL Server have counterparts in R. For example, a table can be represented as a data frame. In terms of data frame operations (ie selecting, filtering, grouping, ordering, windowing functions) a library that implements most of the SQL type operations is the dplyr library. This blog post does not concern dplyr but one its authors (Hadley Wickham) has also authored the ggplot2 library. This library implements “the grammar of graphics” functionality. You programmatically generate your chart using the functions within this library. An example making use of this library now follows.
Suppose you have the following CSV file containing summary results for eight different SQL Server configurations. The raw data for each configuration consists of average transactions per second, end-to-end total batch time, disk service time for the transaction log volume, average seconds per write for the transaction log volume:
You want to visualise the effect of SSD on reducing total batch time compared to the effect of the high performance CPU power plan.
Assuming you have R and RStudio installed on your workstation this is all the code you need to generate a publication quality chart for the results:
# Install the packages if this is the first time they are being used by the user # install.packages('dplyr') # install.packages('ggplot2') # install.packages('reshape2') # Use the libraries library(dplyr) library(ggplot2) library(reshape2) # Read the raw data into a native R data frame # You could just as easily read the raw data from SQL Server using the RODBC library (see a future blog post) rawdata<-read.csv('c:\\temp\\data.csv') # Load the raw data into a dplyr data frame and display the "wide" format results df<-tbl_df(rawdata) df # The df data frame holds its contents in what's know as "wide" format. That is, with columns for the four measures. # Charting works most naturally with "narrow" format data so we unpivot the results using the reshape2 melt function (author Hadley Wickham again) plotdata <- melt(df, id = c("config"), measure = c("avgtps", "totalbatchtime","disktime","avgdisk")) # Display the "narrow" format results. This is format that ggplot will operate on plotdata # Plot the results ggplot(plotdata, aes(x = config, y = value)) + geom_bar(stat="identity") + facet_wrap(~variable,scales="free_y",ncol=1) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + geom_text(aes(label = value, y = value, vjust=-1), size = 3)
The output is as follows:
df in “wide” format:
Source: local data frame [8 x 5] config avgtps totalbatchtime disktime avgdisk 1 instance2 375 7.33 1.2700 0.414 2 instance3highcpuplan 794 4.08 1.0000 0.333 3 instance4 391 8.01 1.2800 0.426 4 instance5 296 10.60 1.7600 0.585 5 instance6 351 8.95 1.9900 0.658 6 instance2highcpuplan 822 4.04 0.9896 0.325 7 instance1san 589 5.42 1.1000 0.365 8 instance1ssd 939 3.49 0.1000 0.038
plotdata in “narrow” format:
config variable value 1 instance2 avgtps 375.0000 2 instance3highcpuplan avgtps 794.0000 3 instance4 avgtps 391.0000 4 instance5 avgtps 296.0000 5 instance6 avgtps 351.0000 6 instance2highcpuplan avgtps 822.0000 7 instance1san avgtps 589.0000 8 instance1ssd avgtps 939.0000 9 instance2 totalbatchtime 7.3300 10 instance3highcpuplan totalbatchtime 4.0800 11 instance4 totalbatchtime 8.0100 12 instance5 totalbatchtime 10.6000 13 instance6 totalbatchtime 8.9500 14 instance2highcpuplan totalbatchtime 4.0400 15 instance1san totalbatchtime 5.4200 16 instance1ssd totalbatchtime 3.4900 17 instance2 disktime 1.2700 18 instance3highcpuplan disktime 1.0000 19 instance4 disktime 1.2800 20 instance5 disktime 1.7600 21 instance6 disktime 1.9900 22 instance2highcpuplan disktime 0.9896 23 instance1san disktime 1.1000 24 instance1ssd disktime 0.1000 25 instance2 avgdisk 0.4140 26 instance3highcpuplan avgdisk 0.3330 27 instance4 avgdisk 0.4260 28 instance5 avgdisk 0.5850 29 instance6 avgdisk 0.6580 30 instance2highcpuplan avgdisk 0.3250 31 instance1san avgdisk 0.3650 32 instance1ssd avgdisk 0.0380
The bar charts making use of the facets command to show each measure as a separate chart and “free_y” scales to show independent y axes. Also, 90 degree rotation of x axis labels and measure value placed above each bar.
Visualising the results you can see that although instance1 with SSD had dramatically reduced transaction log latency compared to instance1 on SAN in terms of reduction of total batch time the bigger “bang for the buck” was seen for instance 2 going from a balance CPU power plan to a high performance CPU power plan. The total batch time went down by almost 3.3 seconds and the average transactions per second more than doubled. This is interesting but is not the point of this blog post.
Using the “grammar of graphics” approach implemented in the ggplot2 library you have complete programmatic control of the elements you wish to see presented and the style of each element. And the great thing is that you can send not just the image to your end users but the command you used to generate the image as well. This command is a piece of script so can be treated as source code that you can version control and maintain history. You can’t do any of this for an Excel based solution.
You don’t need to be an R coding expert to use the scripting language. The learning curve is far lower than that for Powershell. And since the language is a data language and the operations that are performed tend to be vector and matrix type operations it’s a natural fit for those professionals who are already familiar with set based programming with SQL.
With Microsoft’s acquisition of Revolution Analytics there’s a high chance that R will become more prominent in the toolset that Microsoft provide to SQL Server professionals.