Skip to content

API docs: codeintel: begin indexing API docs for search

Warren Gifford requested to merge sg/apidocs-search-indexing-take-3 into main

Created by: slimsag

Third time's the charm :)

  • I had first sent this change in #25206. It turned out to have a fatal flaw in which it was reading from a table in the wrong DB, an easy mistake to make due to us testing and developing against a single DB. Tests wouldn't have helped here, but me and @efritz have discussed some options for preventing this in the future including maybe running separate databases (in the same postgres instance) in dev, CI, or dev+CI environments.
  • I secondarily fixed the issue, which was a small change, in #25666 - but completely fumbled communication with my reviewer and prematurely merged the PR without approval (I've since chatted with both him and my manager, and will make sure this doesn't happen again - sorry again!)
  • Now I'm re-sending again. Third time's the charm (maybe?) :)

Background

  • In #25197 I landed architecture docs to detail how we arrived at using Postgres FTS for API docs search integration, the tradeoffs/implications of doing so, general implementation plan for it, etc.
  • In #25199 I landed the initial DB schema for adding a Postgres FTS index over API docs data (i.e. all symbols we have stored/indexed in API docs, currently just a few thousand Go repos.)

This PR

This PR updates the codeintel worker, and lsifstore, to begin actually writing API docs data to the new lsif_data_documentation_search_public and lsif_data_documentation_search_private tables when new LSIF bundles are uploaded. This is on by default, but can be disabled via the new site config feature flag "apidocs.search.indexing": "disabled".

These tables are for API docs search indexing only, and to prevent any scaling issues / not break the DB, the default configuration "apidocs.search.index-size-limit-factor": 1.0 limits the size of each table independently to 250 million symbols (rows) (500 million across both tables, approx. 12.5k Go repos total). This was arrived at through some estimation documented in the architecture design doc.

Once queries against the table have solidified more, things have settled, and I've optimized the table further (to eliminate data that is needlessly repetitive across rows) we will begin to relax the search index size limit to beyond 500 million symbols / 12.5k Go repos.

Helps #25193

Testing

You'll notice there are no tests in this PR: API docs is still in a stage where we're verifying if this is even something that people want & if we should continue investing in this. If it's likely to maintain stability of the rest of Sourcegraph, I want to spend the time to write the tests. If it's only likely to improve stability of API docs itself, however, it's time I'd rather spend elsewhere for now. This is sometimes a razor thin line, though, so please help me by calling out anywhere you think I've made the wrong choice!

Future

To keep this PR small, the next PRs I have staged are for:

  1. codeintel: Add an OOB migration which migrates existing data into the search index
  2. codeintel: Add the actual code for querying this index, add a new GraphQL API for exposing it, etc.
  3. search: Integrate this into search suggestions

Merge request reports

Loading