Fix broken growth metrics query
Created by: dadlerj
The current query doesn't work as expected:
The first query (latest_usage_by_month) returns one row for every (user x month) pair in the event_logs table.
The second query (sub) joins that table with users. This sort of join goes line-by-line and results in a multiplicative effect, where a row is generated for each match between the two tables. So in this case, since the users table has one row per user, and the latest_usage_by_month table has one row per user per month, the resulting table also has one row per user per month.
This creates a problem on the final query, because there are multiple rows per user in the sub query, none of which actually represent reality (because the join is going line-by-line, it never sees that a user was active in multiple months at once)!
So, on my local instance, where there's only one user, the first query alone (latest_usage_by_month) returns the following:
user_id | month_active
---------+------------------------
1 | 2020-08-01 00:00:00-07
1 | 2020-07-01 00:00:00-07
1 | 2020-06-01 00:00:00-07
1 | 2020-09-01 00:00:00-07
And then, the second query returns:
id | current_month | previous_month | created_month | deleted_month
----+---------------+----------------+------------------------+---------------
1 | f | f | 2020-04-01 00:00:00-07 |
1 | f | t | 2020-04-01 00:00:00-07 |
1 | t | f | 2020-04-01 00:00:00-07 |
(the only reason there aren't four rows here is because your use of DISTINCT users.id, which apparently causes aggregation even without a GROUP BY, which is very weird and unexpected 
DISTINCT without a GROUP BY shouldn't really mean anything...)
And when that gets combined into the final table, it has bizarre results:
created_users | deleted_users | resurrected_users | churned_users | retained_users
---------------+---------------+-------------------+---------------+----------------
0 | 0 | 1 | 1 | 0
So this change does two major things:
- It groups the second query, using
BOOL_OR()operators to join thecurrent_monthandprevious_monthvalues together for a single user. This is essential—it combines each user from multiple rows into a single row. - It removes the
DISTINCT users.idwhich shouldn't have been doing anything anyway, and caused some bizarre behavior that I can't really explain.
This results in the following for the second query (sub):
id | current_month | previous_month | created_month | deleted_month
----+---------------+----------------+------------------------+---------------
1 | t | t | 2020-04-01 00:00:00-07 |
and the following for the final query:
created_users | deleted_users | resurrected_users | churned_users | retained_users
---------------+---------------+-------------------+---------------+----------------
0 | 0 | 0 | 0 | 1
Merge request reports
Activity
Created by: ebrodymoore
@dadlerj one question here: why didn't the original query return this (below)? I understand why the BOOL_OR is a better route but It seemed like the
CASE WHENhad either the T/F option and it shouldn't have been able to create three or four separate rows per user_id and month. Usually I see that behavior with joins but never with a case.id | current_month | previous_month | created_month | deleted_month ----+---------------+----------------+------------------------+--------------- 1 | t | t | 2020-04-01 00:00:00-07 |Created by: dadlerj
Good question. The issue was that there was nothing that ever grouped results together. Unless you use a
GROUP BY, SQLSELECTqueries essentially just loop through every row in a table that matches and output a new row based on what you selected...So the
CASEstatement will be run for every row in the input table, and then the result would be added to every row, but it wouldn't combine those rows.And because the very first subquery you had grouped on both 1 and 2, you ended up with multiple rows per user at every step in the flow from there:
WITH latest_usage_by_user AS ( SELECT user_id, DATE_TRUNC('month', timestamp) AS month_active FROM event_logs GROUP BY 1, 2 ),Let me know if that's clear. It can be helpful to go subquery-by-subquery to test results along the way.