Archive for May, 2014

LAG and LEAD for SQL Server 2008

The LAG and LEAD analytical functions in SQL Server 2012 are very useful. Unfortunately, these functions don’t exist prior to the 2012 version. Most pre-2012 solutions resort to using self joins to identify the lagging or leading row. Here’s a solution that references the source table once:


WITH balance_details
AS
(
SELECT * FROM (
VALUES
('Tom','20140101',100),
('Tom','20140102',120),
('Tom','20140103',150),
('Tom','20140104',140),
('Tom','20140105',160),
('Tom','20140106',180),
('Jerry','20140101',210),
('Jerry','20140102',240),
('Jerry','20140103',230),
('Jerry','20140104',270),
('Jerry','20140105',190),
('Jerry','20140106',200),
('David','20140101',170),
('David','20140102',230),
('David','20140103',240),
('David','20140104',210),
('David','20140105',160),
('David','20140106',200)
) AS t (customer, balancedate,balance)
),
balance_cte AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY customer ORDER BY balancedate) rn
,(ROW_NUMBER() OVER (PARTITION BY customer ORDER BY balancedate))/2 rndiv2
,(ROW_NUMBER() OVER (PARTITION BY customer ORDER BY balancedate) + 1)/2 rnplus1div2
/*
,COUNT(*) OVER (PARTITION BY customer) partitioncount
*/
,customer
,balancedate
,balance
FROM balance_details
)
SELECT
rn
,rndiv2
,rnplus1div2
,customer,balancedate,balance
,CASE WHEN rn%2=1 THEN MAX(CASE WHEN rn%2=0 THEN balance END) OVER (PARTITION BY customer,rndiv2) ELSE MAX(CASE WHEN rn%2=1 THEN balance END) OVER (PARTITION BY customer,rnplus1div2)
END AS balance_lag
,CASE WHEN rn%2=1 THEN MAX(CASE WHEN rn%2=0 THEN balance END) OVER (PARTITION BY customer,rnplus1div2) ELSE MAX(CASE WHEN rn%2=1 THEN balance END) OVER (PARTITION BY customer,rndiv2)
END AS balance_lead
/*
,MAX(CASE WHEN rn=1 THEN balance END) OVER (PARTITION BY customer) AS first_value
,MAX(CASE WHEN rn=partitioncount THEN balance END) OVER (PARTITION BY customer) AS last_value
,MAX(CASE WHEN rn=4 THEN balance END) OVER (PARTITION BY customer) AS fourth_value
*/
FROM balance_cte
ORDER BY customer,balancedate

The results are as follows:

laglead

(The commented out code shows how to obtain the equivalent of the SQL 2012 FIRST_VALUE and LAST_VALUE. I also show how to obtain the Nth value where I’ve specified N as 4)

Regarding the LAG code it relies on the fact that integer division by 2 of the ROW_NUMBER can be used in the partitioning function. For example, 2 and 3 div 2 give 1. When the the current row is 3 (ie rn mod 2 is 1) we examine the window frame rows where rndiv2 is 1 and choose the balance value where the frame row is even (ie rn mod 2 is 0). That is, the value for row 2. When the current row is 2 (ie rn mod 2 is 0) we examine the window frame rows where rnplus1div2 is 1 and choose the balance value where the frame row is odd (ie rn mod 2 is 1). That is, the value for row 1.

For LEAD the priniciple is the same, we just use the opposite ‘division by 2’ expression in the partition clause.

Leave a comment

Finding the position of a subsequence in a larger sequence

This post is inspired by this article: http://sqlmag.com/t-sql/identifying-subsequence-in-sequence-part-2

Particularly the very elegant solution posted by Pesomannen in the Discussion section. I’m going to translate this SQL solution into an equivalent Python pandas solution.

Instead of using integers as the sequence values I’m going to use lower case letters. Also, rather than having one column in which I search for the required subsequence I’m going to search in two columns for a specified pair of subsequences. The two searched columns are made up of 200,000 random values consisting of the letters a to f. The subsequence I’m going to search for are two columns of four letters. Specifically, the two subsequences made up of val1 =  (‘e’,’a’,’d’,’b’) and val2 = (‘a’,’d’,’c’,’e’). Here’s the solution:


In [1]:import pandas as pd

In [2]:import numpy as np

In [3]:import string

In [4]:letters = np.asarray(list(string.letters), dtype=object)

In [5]:T1 = pd.DataFrame(dict(val1=letters[:6][np.random.randint(0, 6, 200000)],val2=letters[:6][np.random.randint(0, 6, 200000)]))

In [6]:T1 = T1.reset_index()

In [7]:T1[:5]

Out[7]:
      index val1  val2
0     0     e     c
1     1     f     a
2     2     d     f
3     3     c     e
4     4     e     e

I’m showing the first five rows out of a total of 200,000 rows. I’m going to search for this subsequence:


In [8]:P = pd.DataFrame(data = {'val1': ['e','a','d','b'],'val2': ['a','d','c','e']})

In [9]:P = P.reset_index()

In [10]:P

Out[10]:
      index val1  val2
0     0     e     a
1     1     a     d
2     2     d     c
3     3     b     e

The first step is to join the subsequence to the 200,000 rows matching on val1 and val2:


In [11]:J = T1.merge(P,on=['val1','val2'],how='inner')

In [12]:J

Out[12]:
     index_x val1  val2  index_y
0     22     b     e     3
1     41     b     e     3
2     69     b     e     3
3     81     b     e     3
4     221     b     e     3
5     237     b     e     3
6     275     b     e     3
7     348     b     e     3
8     453     b     e     3
9     507     b     e     3

I’m showing the first 10 matching rows.
Now the elegant SQL solution is based on the fact that if you group by the difference in the matching keys the subsequence matches occur when the group by counts equal the length of the subsequence. This translates to the following:
In [13]:FullResult = J.groupby(J['index_x'] - J['index_y'])['index_x'].agg({min,max,'count'})

In [14]:FullResult[FullResult['count'] == 4]

Out[14]:
           count min        max
180767     4     180767     180770
So this result tells up that the subsequence is found in positions 180767 to 180770. Let’s check but interrogating the original 200,000 rows between these key positions:
In [15]:T1[180760:180775]

Out[15]:
           index      val1  val2
180760     180760     a     c
180761     180761     c     f
180762     180762     e     e
180763     180763     d     e
180764     180764     d     b
180765     180765     d     f
180766     180766     e     d
180767     180767     e     a    <----
180768     180768     a     d    <----
180769     180769     d     c    <----
180770     180770     b     e    <----
180771     180771     a     f
180772     180772     e     f
180773     180773     f     c
180774     180774     f     d

 

You see the matching subsequence in positions 180767 to 180770.

The point of this post is to show there are tools other that SQL that can be used to solve such puzzles. The Python pandas module provides an SQL-like set of operations for joining data frames. These operations are not that difficult to grasp if you come from an SQL background.

Leave a comment