Archive for category Python

Finding the position of a subsequence in a larger sequence

This post is inspired by this article: http://sqlmag.com/t-sql/identifying-subsequence-in-sequence-part-2

Particularly the very elegant solution posted by Pesomannen in the Discussion section. I’m going to translate this SQL solution into an equivalent Python pandas solution.

Instead of using integers as the sequence values I’m going to use lower case letters. Also, rather than having one column in which I search for the required subsequence I’m going to search in two columns for a specified pair of subsequences. The two searched columns are made up of 200,000 random values consisting of the letters a to f. The subsequence I’m going to search for are two columns of four letters. Specifically, the two subsequences made up of val1 =  (‘e’,’a’,’d’,’b’) and val2 = (‘a’,’d’,’c’,’e’). Here’s the solution:


In [1]:import pandas as pd

In [2]:import numpy as np

In [3]:import string

In [4]:letters = np.asarray(list(string.letters), dtype=object)

In [5]:T1 = pd.DataFrame(dict(val1=letters[:6][np.random.randint(0, 6, 200000)],val2=letters[:6][np.random.randint(0, 6, 200000)]))

In [6]:T1 = T1.reset_index()

In [7]:T1[:5]

Out[7]:
      index val1  val2
0     0     e     c
1     1     f     a
2     2     d     f
3     3     c     e
4     4     e     e

I’m showing the first five rows out of a total of 200,000 rows. I’m going to search for this subsequence:


In [8]:P = pd.DataFrame(data = {'val1': ['e','a','d','b'],'val2': ['a','d','c','e']})

In [9]:P = P.reset_index()

In [10]:P

Out[10]:
      index val1  val2
0     0     e     a
1     1     a     d
2     2     d     c
3     3     b     e

The first step is to join the subsequence to the 200,000 rows matching on val1 and val2:


In [11]:J = T1.merge(P,on=['val1','val2'],how='inner')

In [12]:J

Out[12]:
     index_x val1  val2  index_y
0     22     b     e     3
1     41     b     e     3
2     69     b     e     3
3     81     b     e     3
4     221     b     e     3
5     237     b     e     3
6     275     b     e     3
7     348     b     e     3
8     453     b     e     3
9     507     b     e     3

I’m showing the first 10 matching rows.
Now the elegant SQL solution is based on the fact that if you group by the difference in the matching keys the subsequence matches occur when the group by counts equal the length of the subsequence. This translates to the following:
In [13]:FullResult = J.groupby(J['index_x'] - J['index_y'])['index_x'].agg({min,max,'count'})

In [14]:FullResult[FullResult['count'] == 4]

Out[14]:
           count min        max
180767     4     180767     180770
So this result tells up that the subsequence is found in positions 180767 to 180770. Let’s check but interrogating the original 200,000 rows between these key positions:
In [15]:T1[180760:180775]

Out[15]:
           index      val1  val2
180760     180760     a     c
180761     180761     c     f
180762     180762     e     e
180763     180763     d     e
180764     180764     d     b
180765     180765     d     f
180766     180766     e     d
180767     180767     e     a    <----
180768     180768     a     d    <----
180769     180769     d     c    <----
180770     180770     b     e    <----
180771     180771     a     f
180772     180772     e     f
180773     180773     f     c
180774     180774     f     d

 

You see the matching subsequence in positions 180767 to 180770.

The point of this post is to show there are tools other that SQL that can be used to solve such puzzles. The Python pandas module provides an SQL-like set of operations for joining data frames. These operations are not that difficult to grasp if you come from an SQL background.

Leave a comment

Data analysis operations using the Python pandas module

Within the SQL Server world we tend to use tools such as raw SQL or MDX code, Excel, DAX in Power Pivot and may be even the M language in Power Query to do most of our data analysis work. Outside of these Microsoft orientated tools we may even use the R language which despite being around for a very long time is only now being recognised within the SQL Server business intelligence community. In this post I want to mention another data analysis tool which is feature rich, has a large user base (as a result of its parent language) and which can be used to bridge the gap between the Windows and SQL Server world and the other worlds which include technologies such as NoSQL. The data analysis tool that I’m referring to is the pandas module used within the Python language.

As is the case for R the Python language has a very large user base. The Python pandas module provides the DataFrame object which is analogous to the same object in R. SQL type operations such as join, merge, group by, pivot can all be performed on pandas DataFrames. I’ll illustrate the use of the pandas module using some very simple examples. I’m using the freely available Anaconda Python 2.7 (64 bit) distribution on Windows. You can obtain this from https://store.continuum.io/cshop/anaconda/.  I’m using the interactive IPython qtconsole to run all the code.

This first section of code shows how to connect to a SQL Server data source:


In [1]: import pandas as pd

In [2]: import pandas.io.sql as psql

In [3]: import pypyodbc

In [4]: conn = pypyodbc.connect('DRIVER={SQL Server};SERVER=mysqlserver;DATABASE=mydatabase;Trusted_Connection=yes')

In [5]: waitstats = psql.read_frame('SELECT sampletime,wait_type,waiting_tasks_count,wait_time_ms FROM dbo.capwaitstats', conn)

In [6]: waitstats.head(10)
Out[6]:
                  sampletime      wait_type  waiting_tasks_count  wait_time_ms
0 2014-03-25 11:19:15.610000  MISCELLANEOUS                    0             0
1 2014-03-25 11:19:15.610000    LCK_M_SCH_S                    0             0
2 2014-03-25 11:19:15.610000    LCK_M_SCH_M                   15             6
3 2014-03-25 11:19:15.610000        LCK_M_S                    5          1029
4 2014-03-25 11:19:15.610000        LCK_M_U                    0             0
5 2014-03-25 11:19:15.610000        LCK_M_X                    0             0
6 2014-03-25 11:19:15.610000       LCK_M_IS                    0             0
7 2014-03-25 11:19:15.610000       LCK_M_IU                    0             0
8 2014-03-25 11:19:15.610000       LCK_M_IX                    0             0
9 2014-03-25 11:19:15.610000      LCK_M_SIU                    0             0

[10 rows x 4 columns]

The pypyodbc module is not included in the Anaconda Python distribution but is very easy to install once Python is present. You simply run this to install it:

pip install pypyodbc

The actual execution above fetches data from an historical waitstats table. This table stores the waitstats in their raw form. That is, just like the DMV that they are sourced from the counts and times are cumulative figures since the start time of the SQL Server. The head command is used to show the first 10 rows of the dataframe. The first column (showing 0 to 9) represents the default index for the dataframe.

You’ll notice that I didn’t specify an ‘ORDER BY wait_type,sampletime’ clause in the SQL that fetches the data from the database. It’s best practice to include such a clause but if you haven’t or are unable to do so then you can sort the dataframe as shown next:

In [7]: waitstats.sort(['wait_type','sampletime'], inplace=True)
In [8]: waitstats.set_index(['wait_type','sampletime'],inplace=True)

In [9]: waitstats.ix['OLEDB']
Out[9]:                     waiting_tasks_count  wait_time_ms
sampletime                                                   
2014-03-25 11:19:15.610000                22733          1534
2014-03-25 11:19:20.613000                22733          1534
2014-03-25 11:19:25.617000                24257          1677
2014-03-25 11:19:30.620000                24257          1677
2014-03-25 11:19:35.623000                24257          1677
2014-03-25 11:19:40.627000                24257          1677
2014-03-25 11:19:45.630000                24257          1677
2014-03-25 11:19:50.633000                24257          1677
2014-03-25 11:19:55.637000                24257          1677
2014-03-25 11:20:00.640000                25781          1816
2014-03-25 11:20:05.643000                25781          1816
2014-03-25 11:20:10.647000                26641          1876
2014-03-25 11:20:15.650000                26641          1876
2014-03-25 11:20:20.653000                26641          1876
2014-03-25 11:20:25.657000                26641          1876
2014-03-25 11:20:30.660000                26641          1876
2014-03-25 11:20:35.663000                26641          1876
2014-03-25 11:20:40.667000                27435          1935
2014-03-25 11:20:45.670000                27435          1935
2014-03-25 11:20:50.673000                27435          1935

[20 rows x 2 columns]

After the sort I create a multi-index on wait_type and sampletime. I show the use of an index fetch using ix to select out one of the wait types.

As mentioned above the waitstats are cumulative figures. The dataframe will now be grouped by the wait_type index level and the diff command (equivalent of lead/lag) will be used to obtain the waitstat deltas:


In [10]: waitstats = waitstats.groupby(level=['wait_type']).diff()

In [11]: waitstats.ix['OLEDB']
Out[11]:
                            waiting_tasks_count  wait_time_ms
sampletime                                                   
2014-03-25 11:19:15.610000                  NaN           NaN
2014-03-25 11:19:20.613000                    0             0
2014-03-25 11:19:25.617000                 1524           143
2014-03-25 11:19:30.620000                    0             0
2014-03-25 11:19:35.623000                    0             0
2014-03-25 11:19:40.627000                    0             0
2014-03-25 11:19:45.630000                    0             0
2014-03-25 11:19:50.633000                    0             0
2014-03-25 11:19:55.637000                    0             0
2014-03-25 11:20:00.640000                 1524           139
2014-03-25 11:20:05.643000                    0             0
2014-03-25 11:20:10.647000                  860            60
2014-03-25 11:20:15.650000                    0             0
2014-03-25 11:20:20.653000                    0             0
2014-03-25 11:20:25.657000                    0             0
2014-03-25 11:20:30.660000                    0             0
2014-03-25 11:20:35.663000                    0             0
2014-03-25 11:20:40.667000                  794            59
2014-03-25 11:20:45.670000                    0             0
2014-03-25 11:20:50.673000                    0             0

[20 rows x 2 columns]

In [12]: waitstats = waitstats.dropna()

You can see that with the assignment statement I’m overwriting the original waitstats dataframe that held the cumulative figures with the results of the groupby holding just the deltas. The dropna command is used to remove the NaN rows.

Now to identify the top five waits based on the sum of the waiting_tasks_counts I use this:


In [13]: topwaits = waitstats.groupby(level=['wait_type'])['waiting_tasks_count'].sum().order(ascending=False)[:5]

In [14]: topwaits
Out[14]:
wait_type
OLEDB               4702
LATCH_EX            1524
ASYNC_NETWORK_IO     914
SLEEP_TASK           505
PAGEIOLATCH_SH       450
Name: waiting_tasks_count, dtype: float64

The wait_type index from the topwaits object can be used to index into the main waitstats dataframe to fetch the full data for the top five waits:


In [15]: topwaits = waitstats.ix[topwaits.index]

In [16]: topwaits.head(10)
Out[16]:
                                             waiting_tasks_count  wait_time_ms
wait_type        sampletime
ASYNC_NETWORK_IO 2014-03-25 11:19:20.613000                    0             0
                 2014-03-25 11:19:25.617000                  182           999
                 2014-03-25 11:19:30.620000                    0             0
                 2014-03-25 11:19:35.623000                  182           871
                 2014-03-25 11:19:40.627000                  182           914
                 2014-03-25 11:19:45.630000                    2             0
                 2014-03-25 11:19:50.633000                    0             0
                 2014-03-25 11:19:55.637000                  182           877
                 2014-03-25 11:20:00.640000                    0             0
                 2014-03-25 11:20:05.643000                    0             0

[10 rows x 2 columns]

Now we’re almost at the end of this analysis. I want to pivot the wait_type and show the waiting_task_count as the value. The unstack command is used for this:

In [17]: pivotted = topwaits.unstack(['wait_type'])['waiting_tasks_count']

In [18]: pivotted
Out[18]:
wait_type                  ASYNC_NETWORK_IO LATCH_EX OLEDB PAGEIOLATCH_SH SLEEP_TASK
sampletime
2014-03-25 11:19:20.613000                0        0     0              0         27
2014-03-25 11:19:25.617000              182      649  1524            139         26
2014-03-25 11:19:30.620000                0        0     0              0         28
2014-03-25 11:19:35.623000              182        0     0              0         26
2014-03-25 11:19:40.627000              182        0     0              0         27
2014-03-25 11:19:45.630000                2        0     0              0         26
2014-03-25 11:19:50.633000                0        0     0              0         26
2014-03-25 11:19:55.637000              182        0     0              0         27
2014-03-25 11:20:00.640000                0      642  1524            139         26
2014-03-25 11:20:05.643000                0        0     0              0         27
2014-03-25 11:20:10.647000                0       82   860             99         26
2014-03-25 11:20:15.650000                0        0     0              0         27
2014-03-25 11:20:20.653000                0        0     0              0         26
2014-03-25 11:20:25.657000              182        0     0              0         27
2014-03-25 11:20:30.660000                0        0     0              0         27
2014-03-25 11:20:35.663000                0        0     0              0         26
2014-03-25 11:20:40.667000                0      151   794             73         27
2014-03-25 11:20:45.670000                0        0     0              0         26
2014-03-25 11:20:50.673000                2        0     0              0         27

[19 rows x 5 columns]

And last of all we want to plot the values. The matplotlib module is the one to use here:


In [19]: import matplotlib.pyplot as plt

In [20]: pivotted.plot()
Out[20]: <matplotlib.axes.AxesSubplot at 0xbf81198>

In [21]: plt.show()

Here’s the graph:

waitstats

There are probably more efficient ways to convert the cumulative stats to deltas than the approach I’ve shown above. However, this isn’t really the point of the blog post. The intention is to show just how much manipulation can be done using the pandas module. I’ve only touched on some of the commands. There is comprehensive documentation at http://pandas.pydata.org/pandas-docs/stable/index.html

In terms of use cases, areas that come to mind are any form of time series analysis, extract, transform, load scenarios, aggregations (eg rolling averages, rolling sums etc), fast generation of randomised test data etc etc. Because of the cross platform nature of Python it’s ideally suited for a scenario I mentioned at the start of this post. NoSQL type databases typically lack some of the features of relational databases. For example, Cassandra doesn’t allow joins between tables or complex aggregations. Architects typically take this into consideration by incorporating denormalisation into the data model and using Hadoop/Hive for complex reporting . With the Python CQL module providing the equivalent of the ODBC connectivity to Cassandra and the pandas module providing the data analysis function you can see that what is seen as limitations in the NoSQL database can be overcome by solutions available in the middleware.

If you are a SQL Server person who thinks that the pandas module may be of interest here is the definitive reference from the author of the module:

http://www.amazon.co.uk/Python-Data-Analysis-Wrangling-IPython/dp/1449319793

The content assumes very little previous knowledge of Python. Prior to dipping into this reference I myself knew hardly any of the language constructs yet I’m now able to understand many of the questions that arise on forums such as StackOverflow.

Leave a comment