Enable selective tracing with Jaeger and update Jaeger site config schema
Created by: beyang
Addresses #9300
From the updated CHANGELOG:
- Distributed tracing is a powerful tool for investigating performance issues. The following changes
have been made with the goal of making it easier to use distributed tracing with Sourcegraph:
-
The site configuration field
"tracing.distributedTracing": { "sampling" }
allows a site admin to control which requests generate tracing data.-
"all"
will trace all requests. -
"selective"
will trace all requests initiated from an end-user URL with?trace=1
. Non-end-user-initiated requests can set a HTTP headerX-Sourcegraph-Should-Trace: true
. This is the recommended setting, as"all"
can generate large amounts of tracing data that may cause network and memory resource contention in the Sourcegraph instance. -
"none"
turns off tracing.
-
-
Jaeger is now the officially supported distributed tracer. The following is the recommended site configuration to connect Sourcegraph to a Jaeger agent (which must be deployed on the same host and listening on the default ports):
"tracing.distributedTracing": { "type": "jaeger", "sampling": "selective" }
-
The site configuration field,
useJaeger
, is deprecated in favor of"tracing.distributedTracing": { "type": "jaeger" }
. -
The site configuration field
"experimentalFeatures": { "debug.log": { "opentracing" } }
toggles debug logging that logs every call initiated from the opentracing (Jaeger) client. -
Support for configuring Lightstep as a distributed tracer is deprecated and will be removed in a subsequent release. Because most Sourcegraph instances are deployed on-prem and Lightstep is only available "in the Cloud", usage of Lightstep was very low or non-existent. If you are a paying customer and would like us to maintain support, please email [email protected].
-
Other notes:
- Reviewers should try this out in dev, given that Jaeger is now run by default in our dev environment. The main thing to try is to use the
"selective"
setting and toggle on?trace=1
in the URL to notice Jaeger trace collection turn on/off for the given request tree. - The diff touches many files, because I had to update all invocations of the opentracing API to go through the
internal/trace/ot
package (which implements the "selective" tracing behavior described in the CHANGELOG). - The field name
tracing.distributedTracing
was made, because I anticipate wanting to addtracing.nettrace
shortly. If anyone prefers a different naming scheme or site config structure, please comment.
TODO
-
Investigating remaining issue with debug
(toggle doesn't work)
Following merge, I will do the following:
-
Update the Sourcegraph deployment repositories to have users install Jaeger by default. I will also provide instructions to deploy Jaeger for existing instances that do not already have it. -
Update docs.sourcegraph.com to document this on the site admin tracing docs page. -
Open up a tech-debt issue to migrate to opentelemetry (which has a context-aware API for starting spans) -
Look into adding httptrace (for client-side tracing) to debug networking issues.