Finding the position of a subsequence in a larger sequence

This post is inspired by this article:

Particularly the very elegant solution posted by Pesomannen in the Discussion section. I’m going to translate this SQL solution into an equivalent Python pandas solution.

Instead of using integers as the sequence values I’m going to use lower case letters. Also, rather than having one column in which I search for the required subsequence I’m going to search in two columns for a specified pair of subsequences. The two searched columns are made up of 200,000 random values consisting of the letters a to f. The subsequence I’m going to search for are two columns of four letters. Specifically, the two subsequences made up of val1 =  (‘e’,’a’,’d’,’b’) and val2 = (‘a’,’d’,’c’,’e’). Here’s the solution:

In [1]:import pandas as pd

In [2]:import numpy as np

In [3]:import string

In [4]:letters = np.asarray(list(string.letters), dtype=object)

In [5]:T1 = pd.DataFrame(dict(val1=letters[:6][np.random.randint(0, 6, 200000)],val2=letters[:6][np.random.randint(0, 6, 200000)]))

In [6]:T1 = T1.reset_index()

In [7]:T1[:5]

      index val1  val2
0     0     e     c
1     1     f     a
2     2     d     f
3     3     c     e
4     4     e     e

I’m showing the first five rows out of a total of 200,000 rows. I’m going to search for this subsequence:

In [8]:P = pd.DataFrame(data = {'val1': ['e','a','d','b'],'val2': ['a','d','c','e']})

In [9]:P = P.reset_index()

In [10]:P

      index val1  val2
0     0     e     a
1     1     a     d
2     2     d     c
3     3     b     e

The first step is to join the subsequence to the 200,000 rows matching on val1 and val2:

In [11]:J = T1.merge(P,on=['val1','val2'],how='inner')

In [12]:J

     index_x val1  val2  index_y
0     22     b     e     3
1     41     b     e     3
2     69     b     e     3
3     81     b     e     3
4     221     b     e     3
5     237     b     e     3
6     275     b     e     3
7     348     b     e     3
8     453     b     e     3
9     507     b     e     3

I’m showing the first 10 matching rows.
Now the elegant SQL solution is based on the fact that if you group by the difference in the matching keys the subsequence matches occur when the group by counts equal the length of the subsequence. This translates to the following:
In [13]:FullResult = J.groupby(J['index_x'] - J['index_y'])['index_x'].agg({min,max,'count'})

In [14]:FullResult[FullResult['count'] == 4]

           count min        max
180767     4     180767     180770
So this result tells up that the subsequence is found in positions 180767 to 180770. Let’s check but interrogating the original 200,000 rows between these key positions:
In [15]:T1[180760:180775]

           index      val1  val2
180760     180760     a     c
180761     180761     c     f
180762     180762     e     e
180763     180763     d     e
180764     180764     d     b
180765     180765     d     f
180766     180766     e     d
180767     180767     e     a    <----
180768     180768     a     d    <----
180769     180769     d     c    <----
180770     180770     b     e    <----
180771     180771     a     f
180772     180772     e     f
180773     180773     f     c
180774     180774     f     d


You see the matching subsequence in positions 180767 to 180770.

The point of this post is to show there are tools other that SQL that can be used to solve such puzzles. The Python pandas module provides an SQL-like set of operations for joining data frames. These operations are not that difficult to grasp if you come from an SQL background.
  1. Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: