Ping investigation: Instances are offline
Created by: unknwon
Context
- Observation: We started seeing more (large) instances became offline.
- Affected versions: Since 3.14.
- Result: We can't receive pings and customer instances won't be able to receive update notifications.
Conclusions
- Our queries are really slow. Based on pghero, it take 44s on average on dot-com:
- We run too many queries for each ping. With a naive instrumentation, the method
eventLogs.countPerPeriodBySQL
being call 222 times!
+var count = 0
func (l *eventLogs) countPerPeriodBySQL(ctx context.Context, countExpr, interval, period *sqlf.Query, startDate, endDate time.Time, conds []*sqlf.Query) ([]UsageValue, error) {
+ count++
+ fmt.Println("count:", count)
It is not a problem we would be able to notice immediately as it takes time for an instance's event_logs
to grow until queries become too slow to finish before timeout. As of now, dot-com has 13M rows in events_log
.
The major offender is we do daily, weekly and monthly for each type of events we care, e.g. code intel would at least call eventLogs.countPerPeriodBySQL
27 times:
Same phenomena for search stats.
Suggestions
There are three things we could probably do to tackle the problem:
- Make as many queries concurrent as possible.
- Reduce the number of queries by doing some of the calculations without SQL.
- Optimize the query itself (note: almost every column used in
event_logs
table has an index).
Example EXPLAIN ANALYZE
sg=# EXPLAIN ANALYZE WITH all_periods AS
(SELECT generate_series(
('2020-04-01')::timestamp,
('2020-05-01')::timestamp,
('1 month')::interval) AS period),
count_by_period AS
(SELECT DATE_TRUNC('month', timestamp) AS period,
COUNT(DISTINCT CASE WHEN user_id = 0 THEN anonymous_user_id ELSE CAST(user_id AS TEXT) END) AS count
FROM event_logs
WHERE (name='SearchResultsQueried')
GROUP BY period)
SELECT all_periods.period, COALESCE(count, 0)
FROM all_periods
LEFT OUTER JOIN count_by_period ON all_periods.period = (count_by_period.period)::timestamp
ORDER BY period DESC;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Left Join (cost=575347.41..594213.33 rows=1047690 width=16) (actual time=9035.404..9035.408 rows=2 loops=1)
Merge Cond: (all_periods.period = ((count_by_period.period)::timestamp without time zone))
CTE all_periods
-> ProjectSet (cost=0.00..5.02 rows=1000 width=8) (actual time=0.003..0.004 rows=2 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.000..0.001 rows=1 loops=1)
CTE count_by_period
-> GroupAggregate (cost=543215.64..548977.94 rows=209538 width=16) (actual time=8633.789..9035.335 rows=4 loops=1)
Group Key: (date_trunc('month'::text, event_logs."timestamp"))
-> Sort (cost=543215.64..543739.49 rows=209538 width=45) (actual time=8558.695..8610.906 rows=205325 loops=1)
Sort Key: (date_trunc('month'::text, event_logs."timestamp"))
Sort Method: external merge Disk: 11888kB
-> Gather (cost=7744.48..518248.78 rows=209538 width=45) (actual time=122.223..8418.548 rows=205325 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Bitmap Heap Scan on event_logs (cost=6744.48..496294.98 rows=87308 width=45) (actual time=115.814..8424.974 rows=68442 loops=3)
Recheck Cond: (name = 'SearchResultsQueried'::text)
Rows Removed by Index Recheck: 766317
Heap Blocks: exact=19162 lossy=32478
-> Bitmap Index Scan on event_logs_name (cost=0.00..6692.10 rows=209538 width=0) (actual time=103.060..103.060 rows=206060 loops=1)
Index Cond: (name = 'SearchResultsQueried'::text)
-> Sort (cost=69.83..72.33 rows=1000 width=8) (actual time=0.009..0.009 rows=2 loops=1)
Sort Key: all_periods.period DESC
Sort Method: quicksort Memory: 25kB
-> CTE Scan on all_periods (cost=0.00..20.00 rows=1000 width=8) (actual time=0.004..0.006 rows=2 loops=1)
-> Materialize (cost=26294.62..27342.31 rows=209538 width=16) (actual time=9035.390..9035.391 rows=3 loops=1)
-> Sort (cost=26294.62..26818.47 rows=209538 width=16) (actual time=9035.387..9035.387 rows=3 loops=1)
Sort Key: ((count_by_period.period)::timestamp without time zone) DESC
Sort Method: quicksort Memory: 25kB
-> CTE Scan on count_by_period (cost=0.00..4190.76 rows=209538 width=16) (actual time=8633.794..9035.357 rows=4 loops=1)
Planning Time: 0.193 ms
Execution Time: 9043.278 ms