why our syntax highlighter is in bad shape / has a lot of "tech debt", and what we should do about it

Created by: slimsag

Ok here goes.. this is a bit ranty, but bear with me.

We haven't invested at all in syntect_server, it has received likely less than 1 full month of engineering effort in total, by all people at Sourcegraph despite being core to Sourcegraph's functioning and existing for >4 years since I hacked together the initial version in 2 days back when Sourcegraph was an immature startup.

Despite this, it "is basically okay" for most users/customers today, except rarely having issues in very large Sourcegraph deployments like Sourcegraph.com where it is too slow to respond and we fallback to plaintext highlighting.

The goal of this issue is not to say syntect_server is trash and we should throw it out (I think that would be a grave mistake, personally, for many reasons) but rather to explain why we're in a bad shape today and what I hope someone else will pick up here at Sourcegraph to improve the situation.

syntect (the library syntect_server shells out to for syntax highlighting) supports a specific set of grammars out of the box. They only test with a specific set of grammars, and even among those - there are some known issues.
We extend that set of grammars by a lot because over the years we've just chucked in more grammars as needed. For the most part, it "just works" so it's been fine. We don't have any tests that confirm these grammars work, and rely 100% on the upstream project for testing.
Our version of syntect hasn't been updated in at least.. half a year, maybe? it's possible updating it would help, but it requires some careful monitoring because we really don't know what will break in updating it.
So, while everything here "mostly just works" there are sometimes two problems: (1) a specific language/file will load slowly or (2) a specific file (not a whole language, but a specific file) will cause the entire syntect library to either:
- panic (the server crashes)
- deadlock (a single request consumes 100% of its CPU, and the highlighting for that request never, ever, returns.)

So - this means we have three problems that need solving (in a metaphorical sense):

"Highlighting is sometimes slow"
- To fix this, syntect_server caches all of the grammars in-memory. Not the file highlighting, but the actual process of loading a grammar and "compiling" it, basically - so that highlighting is super quick. Still, some grammars are particularly slow on specific files or specific languages because either the language grammar itself uses a ton of very bad regexp (lookaheads, etc.) and is just slow - or because syntect just needs to exhaust a lot of cases in order to get to the result. The only fix for this is optimizing syntect for more languages, or giving it more CPU resources. We have thus far chosen the latter, and with 4-8 CPU it works pretty well.
"panic (the server crashes)"
- This happens on some files - not all. It's a bug in syntect, and is caused by them not testing all the languages we use. Importantly, this is caused because syntect just has poor error handling: it uses panic! and .unwrap() in a few places which is not idiomatic Rust. There is an issue open since 2017, and nobody has worked on it: https://github.com/trishume/syntect/issues/98
- These panics happen rarely, but due to the frequency and number of languages we're highlighting on large deployments (sourcegraph.com, large customer deployments, etc.) they happen frequently there.
- To solve this, we worked around the issue by "catching" the panics using panic::catch_unwind which is NOT idiomatic or a good idea in general - but hey, it does work. https://github.com/sourcegraph/syntect_server/blob/master/src/main.rs#L63-L66
"deadlock (a single request consumes 100% of its CPU, and the highlighting for that request never, ever, returns.)"
- Importantly, other requests will continue working just fine! However, now we've lost 1 full CPU - so much slower!
- To workaround this (because the issue was suddenly critical for two customers at the same time), we wrapped syntect_server in a dumb Go reverse proxy which:
  - Spins up some number of WORKERS - unfortunately each requiring 1.1GB of memory to cache the compiled language grammars separately.
  - Load balances connections to workers, so if one dies only e.g. 1/4th, or 1/8th, etc. the connections actually die. (in which case Sourcegraph falls back to plaintext mode)
  - Monitors every connection to each worker, and if a single one of them takes longer than 10s kills that worker process because it likely means that request has just gotten stuck, and will now have 100% CPU usage for the remainder of the process' lifetime.

Why can't I horizontally scale syntect_server?

The combination of issues we experience right now means it doesn't make sense to.

First, due to the legitimate case of "Highlighting is sometimes slow" - you really do want most requests to have access to a lot of CPU. Highlighting files does not require equal resources, some files require far more than others. In an ideal world, you want multiple beefy syntect_server instances.

With horizontal scaling, we could choose to either continue using panic::catch_unwind or relying on Kubernetes container restarts. However, with the rate of panics on large deployments like Sourcegraph.com - if we were to simply remove panic::catch_unwind it would effectively render the service dead entirely. How many are happening? I'm not sure, the monitoring for worker_deaths got removed from the Grafana dashboard but I think the Prometheus metric may still be around.

Importantly, Kubernetes health checks to a /healthz endpoint on the container do NOT tell us if a request has gotten stuck at 100% CPU and will never return! That may be the case for multiple requests, and all CPUs may be consumed entirely, while the container may respond to /healthz still. The only way we would know is if the /healthz endpoint could somehow monitor all currently pending requests on the server and verify if they have been stuck for some period of time (>10s, like that nasty http-server-stabilizer hack does) syntect does not support a timeout parameter / context like Go does so fixing this is not trivial: we don't know where in the syntect codebase it is getting stuck or why, and syntect is a relatively complex stack machine.

What should we do to fix this?

So far, nobody / no team has been interested / had the time to even investigate this. I want to be clear, syntect_server is basically unmaintained because of a combination of two mentalities: "it works well enough" (generally true) and "stephen owns it" (no longer true)

In my view, the following would give us the best return on engineering effort in order of priority:

Actually record which files are causing syntect to either get entirely stuck, or panicking. i.e. https://github.com/sourcegraph/sourcegraph/issues/16085 and then debug/fix those upstream in syntect.
Add the ability to specify a timeout in syntect so that we can get rid of http-server-stabilizer entirely by using timeouts in the syntect_server Rust code itself. https://github.com/trishume/syntect/issues/202
Remove syntect's usages of panic! and .unwind() so we can get rid of the stack unwinding hack: https://github.com/trishume/syntect/issues/98
Remove http-server-stabilizer entirely, and add support in Sourcegraph for horizontally scaling syntect_server.
Add a test suite with a large corpus of files for each supported language, so that we can actually feel confident in updating syntect without any idea of what is breaking when we do.