Funnel analysis in SQL using window functions, range frames, and regular expressions

A wizard reading a book called "Funnel Analysis SQL Spells" brewing a potion.
The wrong incantation might turn you into a frog or, worse, give you incorrect funnel conversion numbers. Credit: katiebcartoons.com
đź‘‹
Baxter is now a product development and contracting company run by the same two folks behind this idea, Vikram and Gabriel.

We're no longer building an event analytics product, but we're leaving these posts up for posterity.

The funnel analysis SQL generator is still available here.

A few weeks ago we released Baxter's free funnel analysis SQL generator supporting nine different SQL dialects. You can read about how the tool works and the assumptions we make about your data in this introductory blog post.

We encourage you to copy/paste the queries our tool generates for your use cases! Funnel analysis queries are tough to write and the ones we generate are well-tested and will save you time.

But we're also proponents for understanding how the code you're using actually works. In that spirit, we're going to share the ideas behind the query strategies we employ for users and SQL enthusiasts alike.

This is is the first post in a series. In it, I'm going to talk about how to write funnel analysis queries using three modern, common SQL features:

  • Window functions
  • Range frames (instead of row frames)
  • Regular expressions

These queries can analyze loosely ordered funnels with a total sequence time constraint. These terms are defined here.

We use this strategy to generate funnel analysis queries for BigQuery, Spark, ClickHouse, DuckDB, PostgreSQL, and MySQL.

It's much easier to follow the SQL we generate armed with the main ideas underpinning this query strategy, so let's talk about them.

Idea #1: You can use regular expressions to match funnels if you collapse sequences of events into strings.

Here’s how you might collapse a sequence of events into a string using DuckDB’s SQL dialect:

select
    user_id,
    string_agg(
        case
            when event_name = 'View Product'
            then 'A'
            when event_name = 'Add to Cart'
            then 'B'
            when event_name = 'Place Order'
            then 'C'
        end
    ) as event_sequence
from events
where ...
group by user_id

-- user_id | event_sequence
-- ------------------------
--       1   'ABAAC'
--       2   'AAAAAAB'
-- ...

If our funnel steps are “View Product” → “Add to Cart” → “Place Order”, the query above collapses each event into single-character identifiers and turns our funnel steps into “A” → “B” → “C”.

With strings of event sequences, using regular expressions to match how far users have progressed through a funnel is trivial and efficient.

The following query assumes we have a table called event_sequences with columns user_id and event_sequence, generated by our query above.

select 
    -- If there's an A, a user progressed to step 1.
    (select count(*) 
     from event_sequences 
     where regexp_match(event_sequence, 'A') as step_1_count,

    -- If B eventually follows A, a user progressed to step 2.
    (select count(*)
     from event_sequences
     where regexp_match(event_sequence, 'A.*B') as step_2_count,

    -- ... and if C eventually follows B, a user progressed to step 3.
    (select count(*)
     from event_sequences
     where regexp_match(event_sequence, 'A.*B.*C') as step_3_count,    

from event_sequences

-- step_1_count | step_2_count | step_3_count
-- ------------------------------------------
--          142             93             20

Note that our regular expressions permit any other events to occur between each funnel steps. This satisfies our loose ordering constraint.

This technique lets us use regular expressions to determine funnel progress efficiently. But how do we go about ensuring we only attempt to match funnels for event sequences that qualify in our analysis?

Recall that we need to consider two important constraints:

  1. All the events must occur within a specified period: a start and end time.
  2. All the events must occur within a specified duration. For example, all events A, B, and C might need to occur in a 24 hour period in our analysis. (This is our total sequence time constraint.)

#1 is easy — you can filter for events that fall within the start and end time in your where clause.

#2 is trickier and where window functions with range frames come into play.

Idea #2: Assemble a sliding window of events that occur within your funnel's duration by using window functions and range frames.

👉
Familiarity with window functions will help you follow this section of the post. If you need a quick primer I’m partial to DuckDB’s excellent documentation on this powerful SQL feature here.

Let’s assume these are our funnel analysis parameters:

  1. The events must occur in the month of January 2023.
  2. Our funnel steps are: View Product, Add to Cart, and Place Order.
  3. The funnel duration is 24 hours.

Let’s start with the following query, which helps us satisfy parameters #1 and #2 and collapses event sequences into strings. We’ll build on it next to satisfy parameter #3.

This query is written in DuckDB’s SQL dialect.

select 
    user_id,
    string_agg(
        case
            when event_name = 'View Product'
            then 'A'
            when event_name = 'Add to Cart'
            then 'B'
            when event_name = 'Place Order'
            then 'C'
        end
    ) as event_sequence       
from events
where event_timestamp > '2023-01-01' and
      event_timestamp < '2023-02-01' and
      (event_name = 'View Product' or
       event_name = 'Add to Cart' or
       event_name = 'Place Order');
group by user_id;

-- user_id | event_sequence
-- ------------------------
--       1   'ABAAC'
--       2   'AAAAAAB'
-- ...

We’re not doing anything fancy yet. This query assembles event sequences that:

  • … match our funnel steps (View Product, Add to Cart, and Place Order)
  • … and occur within the period that we’re analyzing (Jan 2023)

If we used regular expressions to count users who progressed through our funnel using the query results above, our counts would be incorrect. We have strings of event sequences, but there is no guarantee that these event sequences occurred within a 24-hour period.

The following query (using DuckDB’s SQL dialect) builds on the query above using window functions and range frames to make that guarantee.

select 
    user_id,
    string_agg(
        case
            when event_name = 'View Product'
            then 'A'
            when event_name = 'Add to Cart'
            then 'B'
            when event_name = 'Place Order'
            then 'C'
        end
    ) over funnel_duration as preceding_event_sequence,
    case
        when event_name = 'View Product'
        then 'A'
        when event_name = 'Add to Cart'
        then 'B'
        when event_name = 'Place Order'
        then 'C'
    end as event_identifier
from events
where event_timestamp > '2023-01-01' and
      event_timestamp < '2023-02-01' and
      (event_name = 'View Product' or
       event_name = 'Add to Cart' or
       event_name = 'Place Order')
window funnel_duration as (
    partition by user_id
    order by event_timestamp
    range between interval 24 hours preceding and 
                  current row
)

-- user_id | preceding_event_sequence | event_identifier
-- -----------------------------------------------------
--       1   ''                         'A'
--       1   'A'                        'B'
--       1   'AB'                       'A'
--       1   'ABA'                      'A'
--       1   'AA'                       'C'

Unlike our first query, which had a row for every user, there exists a row for every event in this result set. And preceding_event_sequence contains all the events that occurred within the 24 hours preceding each event!

Our first query showed that user 1’s sequence of events over the course of January 2023 was ABAAC. Examine the event_identifier column in the results above and you’ll find you can reconstruct user 1’s ABAAC event sequence.

But from these results you’ll also find:

  • User 1 progressed through A and B within a 24 hour period.
  • User 1 did not progress through A, B, and C within a 24 hour period.

The range frame in our window specification ensures that preceding_event_sequence maintains a sliding window of events within the 24 hours preceding each event:

...
window funnel_duration as (
    partition by user_id
    order by event_timestamp

    -- This is the range frame.
    range between interval 24 hours preceding and 
                  current row
)
...

If you've used window functions before, you're likely familiar with row frames. Row frames allow users to assemble a sliding window of some number of rows relative to the current row. Range frames are different: they allow users to assemble sliding windows of rows that satisfy some range of values relative to the current row.

In our case, those rows must have event_timestamp values within the 24 hours preceding our current row’s event_timestamp value.

Window functions are already useful, but range frames make them wildly powerful. They might be my favorite modern SQL feature to-date.

Assemble the rest of the query with your SQL bag-of-tricks: CTEs, aggregate functions, etc.

The ideas above decompose our funnel analysis problem into two steps that get us most of the way:

  1. Collapse event sequences into strings to match funnels.
  2. Only collect event sequences that occur within a 24 hour window.

There are a lot of ways you might assemble your final result and they rely on the techniques you use writing everyday SQL. I won't be going into the details here, but you can poke around our funnel SQL generator to see how we do it.

But make no mistake: the SQL will be verbose, impossible to parameterize without using a template language like Jinja (used by DBT), and it won’t make for light bedtime reading.

A note on our MySQL implementation

Our MySQL implementation works with MySQL 8. But it's a bit of hack that uses undocumented behavior. This is not the case for any of our other implementations!

Let me explain.

Unlike other dialects, there is no string aggregation function that works in a windowing context in MySQL. group_concat aggregates values into a string, but it does not work as a window function so we can't use it to implement this query strategy.

MySQL doesn't have a native list or array data type, either! With ClickHouse and Spark, we're able to a) aggregate event sequences into a list, and then b) concatenate all elements in that array. So we can generate ['A', 'B', 'C'] and then turn that into 'ABC'. We can't do that in MySQL.

MySQL does have a JSON type. And we can aggregate event sequences into a JSON array. These look like ['A', 'B', 'C'], but there's no equivalent function for us to turn that into 'ABC'.

Under the hood, however, MySQL's regular expression matching actually works on JSON arrays. It does so in a peculiar way – by matching against the string ["A", "B", "C"]

Here's how that works:

-- 'a' eventually followed by 'b' matches

select REGEXP_LIKE(JSON_ARRAY('a','b','c','c','d'), 'a.*b');
+------------------------------------------------------+
| REGEXP_LIKE(JSON_ARRAY('a','b','c','c','d'), 'a.*b') |
+------------------------------------------------------+
|                                                    1 |
+------------------------------------------------------+
1 row in set (0.00 sec)



-- 'b' follows 'a' in this array. But 'ab' does not match!

mysql> select REGEXP_LIKE(JSON_ARRAY('a','b','c','c','d'), 'ab');
+----------------------------------------------------+
| REGEXP_LIKE(JSON_ARRAY('a','b','c','c','d'), 'ab') |
+----------------------------------------------------+
|                                                  0 |
+----------------------------------------------------+
1 row in set (0.00 sec)



-- The string we're matching is actually:
-- ["a", "b", "c", "c", "d"]
--
-- So let's try matching 'a", "b` instead.

mysql> select REGEXP_LIKE(JSON_ARRAY('a','b','c','c','d'), 'a", "b');
+--------------------------------------------------------+
| REGEXP_LIKE(JSON_ARRAY('a','b','c','c','d'), 'a", "b') |
+--------------------------------------------------------+
|                                                      1 |
+--------------------------------------------------------+
1 row in set (0.00 sec)


I know this is hack! But:

  • We suspect there is a small minority of users who are using MySQL for this task.
  • Our MySQL queries pass all our tests.
  • This strategy is significantly faster than our fallback strategy (self-joins).

If you run into issues using our MySQL implementation, or you'd like us to fall back to another strategy for it, please reach out to us.

Why can't Redshift, Snowflake, or SQLite use this query strategy?

Different reasons!

  • Redshift doesn’t support range frames in window functions. Its dialect is ancient and stuck somewhere in Postgres 8, the database it’s forked from.
  • Snowflake has a rich, modern SQL dialect. But it doesn’t doesn’t support sliding range frames, which the window function/range frame strategy that we’re using requires.
  • SQLite actually supports this strategy! But it won't work out of the box. While SQLite has dedicated regular expression-matching syntax (where <val> REGEXP <regexp>), it actually just calls a function that users must supply. Most implementations do not provide one, so we use our fallback strategy instead.

Some observations on the above:

  • It is frustrating how much Redshift's SQL dialect lacks in features.
  • Snowflake has this one odd omission in its support for window ranges – but it does implement MATCH_RECOGNIZE,  a powerful and expressive operator that's perfect for funnel analysis.
  • I am confused as to why SQLite  dedicates language syntax to regular expression matching without giving users the facility to use it out of the box.

Next: stay tuned to learn how to use Snowflake's MATCH_RECOGNIZE to implement funnel analysis queries

Snowflake's dialect is powerful. You can see MATCH_RECOGNIZE in action in our funnel SQL query generator. I'm excited to share how that works.

Join our waitlist or follow us on Twitter to stay up to date on our blog posts, free tools, and product launches.