Skip to content
Snippets Groups Projects

Fix broken growth metrics query

Created by: dadlerj

The current query doesn't work as expected:

The first query (latest_usage_by_month) returns one row for every (user x month) pair in the event_logs table.

The second query (sub) joins that table with users. This sort of join goes line-by-line and results in a multiplicative effect, where a row is generated for each match between the two tables. So in this case, since the users table has one row per user, and the latest_usage_by_month table has one row per user per month, the resulting table also has one row per user per month.

This creates a problem on the final query, because there are multiple rows per user in the sub query, none of which actually represent reality (because the join is going line-by-line, it never sees that a user was active in multiple months at once)!

So, on my local instance, where there's only one user, the first query alone (latest_usage_by_month) returns the following:

 user_id |      month_active      
---------+------------------------
       1 | 2020-08-01 00:00:00-07
       1 | 2020-07-01 00:00:00-07
       1 | 2020-06-01 00:00:00-07
       1 | 2020-09-01 00:00:00-07

And then, the second query returns:

id | current_month | previous_month |     created_month      | deleted_month 
----+---------------+----------------+------------------------+---------------
  1 | f             | f              | 2020-04-01 00:00:00-07 | 
  1 | f             | t              | 2020-04-01 00:00:00-07 | 
  1 | t             | f              | 2020-04-01 00:00:00-07 | 

(the only reason there aren't four rows here is because your use of DISTINCT users.id, which apparently causes aggregation even without a GROUP BY, which is very weird and unexpected :thinking: . I removed it here because DISTINCT without a GROUP BY shouldn't really mean anything...)

And when that gets combined into the final table, it has bizarre results:

 created_users | deleted_users | resurrected_users | churned_users | retained_users 
---------------+---------------+-------------------+---------------+----------------
             0 |             0 |                 1 |             1 |              0

So this change does two major things:

  1. It groups the second query, using BOOL_OR() operators to join the current_month and previous_month values together for a single user. This is essential—it combines each user from multiple rows into a single row.
  2. It removes the DISTINCT users.id which shouldn't have been doing anything anyway, and caused some bizarre behavior that I can't really explain.

This results in the following for the second query (sub):

 id | current_month | previous_month |     created_month      | deleted_month 
----+---------------+----------------+------------------------+---------------
  1 | t             | t              | 2020-04-01 00:00:00-07 | 

and the following for the final query:

 created_users | deleted_users | resurrected_users | churned_users | retained_users 
---------------+---------------+-------------------+---------------+----------------
             0 |             0 |                 0 |             0 |              1

Merge request reports

Approval is optional

Merged by avatar (Nov 7, 2025 4:57am UTC)

Merge details

  • Changes merged into main with a979ecdf.
  • Deleted the source branch.

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Created by: ebrodymoore

    @dadlerj one question here: why didn't the original query return this (below)? I understand why the BOOL_OR is a better route but It seemed like the CASE WHEN had either the T/F option and it shouldn't have been able to create three or four separate rows per user_id and month. Usually I see that behavior with joins but never with a case.

     id | current_month | previous_month |     created_month      | deleted_month 
    ----+---------------+----------------+------------------------+---------------
      1 | t             | t              | 2020-04-01 00:00:00-07 | 
  • Created by: dadlerj

    Good question. The issue was that there was nothing that ever grouped results together. Unless you use a GROUP BY, SQL SELECT queries essentially just loop through every row in a table that matches and output a new row based on what you selected...

    So the CASE statement will be run for every row in the input table, and then the result would be added to every row, but it wouldn't combine those rows.

    And because the very first subquery you had grouped on both 1 and 2, you ended up with multiple rows per user at every step in the flow from there:

    WITH
      latest_usage_by_user AS (
      SELECT
        user_id,
        DATE_TRUNC('month', timestamp) AS month_active
      FROM
        event_logs
      GROUP BY
        1,
        2 ),

    Let me know if that's clear. It can be helpful to go subquery-by-subquery to test results along the way.

Please register or sign in to reply
Loading