Category: tech

Calculating dynamically anchored windows of data using Recursive CTE in Amazon Redshift

Quite a common problem in data warehouse and analytics is to determine the value of a column relative to a previous record in a partition. For example you want to be able to group user events by user sessions with at least 24 hours interval between each other. In this case you can simply use a regular common table expression (CTE) and a LAG() window function to find previous timestamp for each user event and then calculate if time difference is less 24 hours. Let’s take this dataset input for example:

The last column is the desired output, a boolean indicating if we want to keep the record or not. Here is the SQL that we can use to achieve it:

Pretty simple, right? But what if you want to calculate the value, not based on the previous value, but the first, and then repeat the same approach for the remaining records within each partition? For example 24 hours since the first user event. Here is the same dataset, but the last column with expected result is changed to reflect the new requirements:

Notice rn 11 in the first dataset is set to be ignored, but with the new requirements we want to keep it because it's ts is more than 24 hours from the last record for user_id 1 with keep IS TRUE (rn 9). We can't use LAG() function in this case anymore.

Here is where a FOR loop or a recursion can help us. In case of Amazon Redshift the RECURSIVE CTE has become available recently. Most examples of it demonstrate it’s usage with hierarchical data, but our use case is quite different.

At first we need to numerate the rows using ROW_NUMBER() window function for each user in advance before we pass it into recursion because window and aggregate functions are not supported within it. It will produce a chronologically numbered column that starts with 1 for each user’s event record.

Then we can reference it in our RECURSIVE CTE. The first part of the CTE (before the UNION ALL) identifies the anchor event for each user. The second part tries to match the following event by joining to itself (recursion) using the next row number. If the next event is not more than 24 hours from the previous, it passes the previous end date to the next iteration:

A recommendation from Amazon Redshift is to use an exit criteria to prevent very deep recursions and potential errors. In our case it can be set as a WHERE clause on row number to be less than 100.

Thanks to Tubi for providing challenging tasks as part of my work so I can come up with more interesting solutions. You can find more big data problems and examples on Tubi Engineering.