Use Postgres intarray to store permissions `object_ids`
Created by: unknwon
Context
@mrnugget on Slack
I have to say, I’m really not happy with the application-level joins we now have to do due to authz. Everytime we want to do something, we have to find the repo ID for a patch/changeset/changesetjob (which sometimes requires another DB query) and then we have to send the repo ID through the db.Repos.Get layer, which uses the authzFilter. Only then can we do what we previously did. A lot of things we could do with a single database query (“update … where campaign_id = foo”) now requires 1+ queries to the database (and potentially to the code hosts!).
To which @tsenart replied
It might be interesting to look into https://www.postgresql.org/docs/9.6/intarray.html to store the object_ids, and convert that to a roaring bitmap on the application side, rather than storing the byte form of the roaring bitmap. It even supports set operations. Wondering about storage impact, but performance should be good by reading the docs, given the right index.
Success criteria
- Campaigns permissions queries can be done in pure SQL without application level joins.
-
authzFilter
can be replaced with pure SQL joins. - Storage needs don't grow disproportionately on large instances (say, 1M repos, 10k users)
- Performance impact of pure SQL permissions joins is sub 25 ms.
Technical notes
Migrations
We need to write a SQL migration to change the data type of object_ids
in user_permissions
and user_ids
in repo_permissions
to intarray, as well as enabling the intarray extension with CREATE EXTENSION intarray
. This assumes we just drop existing permissions, instead of migrating them which would require a complex application level migration. Depending on when we do this (i.e. many customers adopt it), we might need the application level migration.
Indexing
From the docs:
Two GiST index operator classes are provided: gist__int_ops (used by default) is suitable for small- to medium-size data sets, while gist__intbig_ops uses a larger signature and is more suitable for indexing large data sets (i.e., columns containing a large number of distinct array values). The implementation uses an RD-tree data structure with built-in lossy compression.
Since we expect object_ids to contain up to as many repo ids as there are repos in an instance (e.g. 400k), we should ensure we use the gist_intbig_ops
index operator class.