Server-side execution of src action exec
Created by: eseliger
Still super WIP
Things I've done so far:
- A very basic version of the remote runner that pulls jobs and runs them in docker, streaming logs to the
appendLog
mutation and finally the patch and status usingupdateActionJob
- GraphQL API definitions and stub API resolvers for all endpoints so the frontend can start again
-
actions
query resolver filled with logic - New DB tables
actions
,action_executions
,action_jobs
- Mock UI for a running action
Things this approach could serve (I tick off what is actually implemented, not just drafted out somehow):
-
An API schema that allows for everything described below -
A UI that shows all actions -
A UI that shows an overview of an action execution -
Saved searches can trigger action executions -
A schedule for action executions can be defined as a CRON pattern -
Uploading the workspace where the action lived on the local machine -
Updating an action while maintaining previous runs -
Streamed logs for all jobs, persisted individually for further inspection -
Collecting the patches in the backend, automatically creating a campaign plan for those upon execution completion. (This will allow for the "Create campaign" and "Update campaign" buttons in the screenshot) -
Attaching a campaign to an action, on completion, the campaign will be updated -
Job timeouts (runner heartbeats, timeout starved jobs) -
The possibility to still submit a campaign plan, when a subset of jobs failed -
UI for associating a saved search with an action -
UI for associating an action with a campaign
Things NOT included in this approach:
- Rerunning an action job while maintaining the old jobs, currently it would need to be cleared
- Cancellation of an execution is not implemented as part of the API/datamodel at this point
Tradeoffs:
- To keep the runner save, I opted for wrapping the "command" type in a docker container as well, this probably requires some more thought and I defaulted it to "ubuntu" for now, but should definitely be configurable at least.
- Running multiple jobs per repository is not possible, this is no regression over the previous version of actions, but to be considered in regards to monorepo support.
Learnings:
- It would probably not be super hard getting something out for 3.14 that includes a subset of the proposed features. The runner somewhat works, but needs better error handling, concurrency control and logging. The API needs more fill-ins for resolvers but they are just tedious to type out not super difficult (at least most of them), and the frontend needs to be implemented (currently a mock).
- Pulling jobs over having a complex Queue/Scheduler is probably perfectly fine for now, and a FIFO queue can be achieved by sorting the jobs ASC by ID
Note: the screenshot shows things that aren't implemented yet, but also shows thing not that are implemented, like log streaming and a proper diff view like known from the campaignplans page.
Update: gif
Known bugs:
- When a campaign is attached,
Creating changeset: error in GraphQL response: Base ref must be a branch
prevent the changesets from being created
Merge request reports
Activity
Created by: eseliger
Related work on src-cli: (that code is even worse than in here, I started learning Go on the src-cli repo
)Created by: eseliger
Follow-up to your review @mrnugget:
In this branch I implemented a prototype of running actions on multiple machines.
Actions, as we know them from
src-cli
already, do have the problem that they, in their current implementation, don't scale too well for very large campaigns. Therefore, I went ahead writing a prototype that adds Sourcegraph into the process of generating those patches, while maintaing the previous means of creating campaigns, because it's not always beneficial to run them via Sourcegraph (debugging an action locally, running a small campaign on a local machine).For that I introduced these entities into the system:
-
Action
, which represents one definition of an action to run, it holds the JSON defining thesteps
, and the associations to saved searches and campaigns. -
ActionExecution
, which is one instance of asrc actions exec
. It belongs to an action and inherits it's currentsteps
definition. For an action, multiple executions can be triggered (a saved search triggers, a user manually wants to run the action again to update his campaign, ...). The parent action itself can be updated to fix issues, so across executions the definition may change, which is why it's saved "redundantly". -
ActionJob
, which is a single run of asteps
definition over a single repository. What is known ofsrc-cli
as running the steps over many repositories, anActionJob
represents a single run over a repo, so allactionJobs
of an execution represent what was previously onesrc actions exec
command.
They relate to each other roughly like this:
Action 1--N ActionExecution 1--N ActionJob
- A user creates an action (currently via the UI, a
src actions exec -remote
flag is to be added) - In the background, the action is created and therefrom, one execution is triggered.
- Upon execution creation, the scope query from the saved config is evaluated and one job for each repo is created and attached to the execution.
- Whenever a
src actions runner
is free to accept a job, it callspullActionJob
on the API, which, if any pending jobs exist, marks the job as running and returns all information needed for executing the definedsteps
. - Every 5 seconds, the logs buffer from the
src actions runner
is streamed to the API usingappendLog
, saving it to theactionJob
. Those logs also appear in the UI for further inspection. - When a runner finished a job successfully, it calls
updateActionJob
, sets the status to completed and delivers any patches, should there have been some created. - The
updateActionJob
handler checks if the action job was the last to finish. If so, the execution itself has finished and is ready for further processing. - The
createCampaignPlanFromPatches
is invoked with the set of all patches for all jobs that were part of this execution. - Optional: If the parent
action
has acampaign
associated, it is automatically updated with the newly generated plan.
Other optionals:
- If an action_job failed, it is possible to try it by resetting it to the
pending
state. A runner will pick it up and, since the execution completed again, a new campaign plan will be created with the updated patches, should it have passed this time.
Alternative entrypoints:
- The query-runner runs over the saved searches. If one triggers, the frontend is called at
createActionExecutionsForSavedSearch
. - The frontend checks if any
Action
is associated to that saved search, if so, a new action execution gets created with the invokation_reason saved search, in the UI that will appear as "This execution ran because saved search changed it's results". - as above
(Not yet implemented):
- A periodic tasks checks the
actions.schedule
column in the DB if any scheduled executions should be triggered. - An execution is created for those actions.
- as above
- Job timeouts: We currently store runner_seen_at to use that as a kind of heartbeat. A periodic check should go by and mark those that were a long time without a heartbeat from a runner as canceled.
- Job cancellation:
Cancellation is supported in the frontend, but the src runner has to check that it's still a running task upon receiving the result of
appendLog
(which is also our heartbeat)
-
Created by: slimsag
Please avoid using the term 'remote code execution' as it's often associated with security vulnerabilities and e.g. I gasped when I got the email notification for this :)
Was there an RFC for this that I missed or is this the RFC itself so-to-speak? Where is the best place to leave feedback/questions/concerns?
Created by: eseliger
hey there this was a weekend project just to try out how much effort it would be to implement such a feature, so we didn't yet come up with an RFC unfortunately. I just keep this branch up to date from time to time when we do larger code changes to upstream, but we aim to get a gql schema in for 3.15 and the actual implementation for it for 3.16-ish (depends on workload coming up this iteration). Any feedback is super welcome and I think the best place for it would be on this PR directly at this point, until we decide the moment has come to complete this and I can sketch out the idea in an RFC. :)
Created by: slimsag
side note:
query-runner
is actually something I would like to remove entirely and put into the frontend as a singleton service running inside of there instead. Saved searches (which is what query-runner provides) is actually 100% managed by the frontend so query-runner is basically a dumb while loop with a sleep statement that performs all actions via thefrontend
's internal HTTP API: getting the defined saved searches, running them, sending email or slack notifications, everything is done through the frontend. So it's kinda crazy to have it in a separatequery-runner
service.Given that, you should consider from first-principles if it makes sense to have campaign running logic scheduled in a separate service as if query-runner did not exist or if it should also be done as a singleton service or via DB-based scheduling within the frontend monolith. I suspect probably the latter makes more sense given this information.
Created by: mrnugget
I suspect a lot of things will change once we go from this PR to RFC to actual implementation, so I'll defer on getting into details, but I wanted to point out some high-level things that might make it easier to place this PR and understand what it is and isn't:
- "Remote code execution": ignore the name, think of it as "server-side execution of
src action exec
", which is what this is about. - I wouldn't even say it's "arbitrary container execution" since the containers/commands that are executed are provided by the customers themselves. It's "arbitrary" from our perspective, but not from theirs.
- The main idea is: move
src action exec
from the local machine to a beefier machine and control it from the Sourcegraph UI.
Think: buildkite-agent but for src-action-exec.
Right now
src action exec
simply shells out to either execute adocker run
or to run a shell command (provided by the user!). The idea is to keep the same execution model, but move it to a different machine that connects to Sourcegraph and receives "jobs" from it.The big question of what
docker
is being called is orthogonal to the changes in this PR, I think.Our idea was that customers can just use any machine they want and launch
src action-agent
(or something like that). If they do that in a DIND environment or a separate machine doesn't matter, as long assrc action-agent
can (1) receive jobs from Sourcegraph (2) shell out and calldocker
.- "Remote code execution": ignore the name, think of it as "server-side execution of
Created by: slimsag
"Remote code execution": ignore the name, think of it as "server-side execution of src action exec", which is what this is about.
I understand I myself can ignore the name and that this is just about server-side execution of campaigns. My point here is that we should avoid putting 'remote code execution' into anything (pull requests, issues, docs) unless it's a security vulnerability. As an example, I do not want someone searching for "Sourcegraph remote code" to turn up results for "Spike on remote code execution" in Google because that looks scary. It's not a huge deal, but good to avoid in general is my point. Any objection to renaming the PR "server-side execution of src action exec" ?
Created by: codecov[bot]
Codecov Report
Merging #8574 into master will decrease coverage by
1.36%
. The diff coverage is20.44%
.@@ Coverage Diff @@ ## master #8574 +/- ## ========================================== - Coverage 44.26% 42.90% -1.37% ========================================== Files 1411 1399 -12 Lines 77383 77679 +296 Branches 6959 6794 -165 ========================================== - Hits 34257 33331 -926 - Misses 39907 41320 +1413 + Partials 3219 3028 -191
Flag Coverage Δ #go ?
#typescript ?
#unit 42.90% <20.44%> (-1.37%)
Impacted Files Coverage Δ cmd/frontend/graphqlbackend/campaigns.go 0.00% <0.00%> (ø)
cmd/query-runner/graphql.go 0.00% <0.00%> (ø)
cmd/query-runner/main.go 0.00% <0.00%> (ø)
...internal/campaigns/resolvers/action_definitions.go 0.00% <0.00%> (ø)
.../internal/campaigns/resolvers/action_executions.go 0.00% <0.00%> (ø)
...rprise/internal/campaigns/resolvers/action_jobs.go 0.00% <0.00%> (ø)
enterprise/internal/campaigns/resolvers/actions.go 0.00% <0.00%> (ø)
enterprise/internal/campaigns/resolvers/agents.go 0.00% <0.00%> (ø)
enterprise/internal/campaigns/workers.go 49.14% <ø> (-0.28%)
internal/campaigns/types.go 9.80% <0.00%> (-1.97%)
... and 418 more