Fix broken growth metrics query
Created by: dadlerj
The current query doesn't work as expected:
The first query (latest_usage_by_month
) returns one row for every (user x month) pair in the event_logs
table.
The second query (sub
) joins that table with users
. This sort of join goes line-by-line and results in a multiplicative effect, where a row is generated for each match between the two tables. So in this case, since the users
table has one row per user, and the latest_usage_by_month
table has one row per user per month, the resulting table also has one row per user per month.
This creates a problem on the final query, because there are multiple rows per user in the sub
query, none of which actually represent reality (because the join is going line-by-line, it never sees that a user was active in multiple months at once)!
So, on my local instance, where there's only one user, the first query alone (latest_usage_by_month
) returns the following:
user_id | month_active
---------+------------------------
1 | 2020-08-01 00:00:00-07
1 | 2020-07-01 00:00:00-07
1 | 2020-06-01 00:00:00-07
1 | 2020-09-01 00:00:00-07
And then, the second query returns:
id | current_month | previous_month | created_month | deleted_month
----+---------------+----------------+------------------------+---------------
1 | f | f | 2020-04-01 00:00:00-07 |
1 | f | t | 2020-04-01 00:00:00-07 |
1 | t | f | 2020-04-01 00:00:00-07 |
(the only reason there aren't four rows here is because your use of DISTINCT users.id
, which apparently causes aggregation even without a GROUP BY
, which is very weird and unexpected DISTINCT
without a GROUP BY
shouldn't really mean anything...)
And when that gets combined into the final table, it has bizarre results:
created_users | deleted_users | resurrected_users | churned_users | retained_users
---------------+---------------+-------------------+---------------+----------------
0 | 0 | 1 | 1 | 0
So this change does two major things:
- It groups the second query, using
BOOL_OR()
operators to join thecurrent_month
andprevious_month
values together for a single user. This is essential—it combines each user from multiple rows into a single row. - It removes the
DISTINCT users.id
which shouldn't have been doing anything anyway, and caused some bizarre behavior that I can't really explain.
This results in the following for the second query (sub
):
id | current_month | previous_month | created_month | deleted_month
----+---------------+----------------+------------------------+---------------
1 | t | t | 2020-04-01 00:00:00-07 |
and the following for the final query:
created_users | deleted_users | resurrected_users | churned_users | retained_users
---------------+---------------+-------------------+---------------+----------------
0 | 0 | 0 | 0 | 1