Fix broken growth metrics query (!13700) · Merge requests · Administrator / sourcegraph

Administrator requested to merge fix-growth-pings into main Sep 08, 2020

Created by: dadlerj

The current query doesn't work as expected:

The first query (latest_usage_by_month) returns one row for every (user x month) pair in the event_logs table.

The second query (sub) joins that table with users. This sort of join goes line-by-line and results in a multiplicative effect, where a row is generated for each match between the two tables. So in this case, since the users table has one row per user, and the latest_usage_by_month table has one row per user per month, the resulting table also has one row per user per month.

This creates a problem on the final query, because there are multiple rows per user in the sub query, none of which actually represent reality (because the join is going line-by-line, it never sees that a user was active in multiple months at once)!

So, on my local instance, where there's only one user, the first query alone (latest_usage_by_month) returns the following:

 user_id |      month_active      
---------+------------------------
       1 | 2020-08-01 00:00:00-07
       1 | 2020-07-01 00:00:00-07
       1 | 2020-06-01 00:00:00-07
       1 | 2020-09-01 00:00:00-07

And then, the second query returns:

id | current_month | previous_month |     created_month      | deleted_month 
----+---------------+----------------+------------------------+---------------
  1 | f             | f              | 2020-04-01 00:00:00-07 | 
  1 | f             | t              | 2020-04-01 00:00:00-07 | 
  1 | t             | f              | 2020-04-01 00:00:00-07 |

(the only reason there aren't four rows here is because your use of DISTINCT users.id, which apparently causes aggregation even without a GROUP BY, which is very weird and unexpected 🤔 . I removed it here because DISTINCT without a GROUP BY shouldn't really mean anything...)

And when that gets combined into the final table, it has bizarre results:

 created_users | deleted_users | resurrected_users | churned_users | retained_users 
---------------+---------------+-------------------+---------------+----------------
             0 |             0 |                 1 |             1 |              0

So this change does two major things:

It groups the second query, using BOOL_OR() operators to join the current_month and previous_month values together for a single user. This is essential—it combines each user from multiple rows into a single row.
It removes the DISTINCT users.id which shouldn't have been doing anything anyway, and caused some bizarre behavior that I can't really explain.

This results in the following for the second query (sub):

 id | current_month | previous_month |     created_month      | deleted_month 
----+---------------+----------------+------------------------+---------------
  1 | t             | t              | 2020-04-01 00:00:00-07 |

and the following for the final query:

 created_users | deleted_users | resurrected_users | churned_users | retained_users 
---------------+---------------+-------------------+---------------+----------------
             0 |             0 |                 0 |             0 |              1

Fix broken growth metrics query

Merge request reports