Thursday, September 29, 2022

Non Consecutive Islands

 There's a lot of different posts out there on solving the "gaps and islands" problem, but I recently struck a slight variation that caused quite a bit of thinking, searching and more thinking. 

The problem I had related to creating a unique timeline view of data from temporal, history, tables, but in this case the values I needed to display would revert back to previous values, and I needed to retain this sequence. Here's a simple example:

sample data


You can see the values for the Manager change from Sam to Mike then back to Sam for Bobs Bikes, and from Mandy to Sam and back to Mandy again for Mandy's Mowers. The output I needed was:

The solutions I initially thought of, and further searches, would group the managers together and loose the sequencing, then I stumbled across a post (which I can't re-find) which, while not exactly what I wanted, set me on the right track. Here's what I came up with:

First up was to union together the history table and current/live table data together into a temp table and add 3 extra columns to the temp table; 
  • a column to hold a hash of the row data I wanted to compare. This was across multiple columns so using  a hash made this easier than trying to compare each column multiple times
  • a numeric column to hold a unique identity id for each row
  • sequence start column, which I'll explain shortly. I also added a sequence end but ended up not using that in the final solution
Using the above sample data, the temp table with the generated hash and id now looks like this.
 
with cte as(
    select *,
    hashbytes('sha2_256', concat(CompanyName, '|', Manager)) as sha2,
    row_number() over(order by CompanyName,validfrom) as rm
    from #t
)

update cte set rn = rm, sha = sha2;
 

Next step is to flag the rows that start a sequence. To do this I compared the hash for each row with the hash for the following row, if different the sequence start was set to 1 to flag that the row was the start of a sequence. The main goal here is to build a genuine chronological sequence, as the same hash may appear at different points in the timeline.
 
with cte as(
select t1.* , case when t1.sha <> t2.sha or t2.sha is null then 1 else 0 end as is_start
from #t t1
left join #t t2 on t1.CompanyName = t2.CompanyName and t1.rn  = t2.rn + 1
)

update cte set seq_start = is_start;


 
Now the real magic, using a SUM windowing function with a rows based frame I could assign a group number to each sequence, per company.
 
select *, sum(seq_start) over(partition by CompanyName order by ValidFrom rows unbounded preceding) as grp
    from #t


And from there it was straightforward to grab the first and last date values to build the final output. One note here though, the LAST_VALUE function needs the rows between current row and unbounded following part to work correctly

with cte as(
    select *, sum(seq_start) over(partition by CompanyName order by ValidFrom rows unbounded preceding) as grp
    from #t
)
select distinct CompanyName, Manager,
    FIRST_VALUE(ValidFrom) over(partition by CompanyName, grp order by ValidFrom) as ValidFrom,
    LAST_VALUE(ValidTo) over(partition by CompanyName, grp order by ValidTo rows between current row and unbounded following) as ValidTo
from cte
order by CompanyName, ValidFrom



I was pretty happy with how this worked in the end. We decided to rebuild the history table, rather than have this as a live query, so performance didn't need to be super quick, but it did actually perform a lot better than expected. Updating the history table had a couple of quirks, but that's a story for another day.


No comments:

Post a Comment

Azure Data Factory Metadata-Driven Pipelines - 1

 I've had a bit of free time work wise lately so figured I'd finally get onto revamping one of our Azure Data Factory (ADF) extract ...