Skip to content
Snippets Groups Projects

Server-side execution of src action exec

Closed Administrator requested to merge campaigns/remote-code into main

Created by: eseliger

Still super WIP

Things I've done so far:

  • A very basic version of the remote runner that pulls jobs and runs them in docker, streaming logs to the appendLog mutation and finally the patch and status using updateActionJob
  • GraphQL API definitions and stub API resolvers for all endpoints so the frontend can start again
  • actions query resolver filled with logic
  • New DB tables actions, action_executions, action_jobs
  • Mock UI for a running action

Things this approach could serve (I tick off what is actually implemented, not just drafted out somehow):

  • An API schema that allows for everything described below
  • A UI that shows all actions
  • A UI that shows an overview of an action execution
  • Saved searches can trigger action executions
  • A schedule for action executions can be defined as a CRON pattern
  • Uploading the workspace where the action lived on the local machine
  • Updating an action while maintaining previous runs
  • Streamed logs for all jobs, persisted individually for further inspection
  • Collecting the patches in the backend, automatically creating a campaign plan for those upon execution completion. (This will allow for the "Create campaign" and "Update campaign" buttons in the screenshot)
  • Attaching a campaign to an action, on completion, the campaign will be updated
  • Job timeouts (runner heartbeats, timeout starved jobs)
  • The possibility to still submit a campaign plan, when a subset of jobs failed
  • UI for associating a saved search with an action
  • UI for associating an action with a campaign

Things NOT included in this approach:

  • Rerunning an action job while maintaining the old jobs, currently it would need to be cleared
  • Cancellation of an execution is not implemented as part of the API/datamodel at this point

Tradeoffs:

  • To keep the runner save, I opted for wrapping the "command" type in a docker container as well, this probably requires some more thought and I defaulted it to "ubuntu" for now, but should definitely be configurable at least.
  • Running multiple jobs per repository is not possible, this is no regression over the previous version of actions, but to be considered in regards to monorepo support.

Learnings:

  • It would probably not be super hard getting something out for 3.14 that includes a subset of the proposed features. The runner somewhat works, but needs better error handling, concurrency control and logging. The API needs more fill-ins for resolvers but they are just tedious to type out not super difficult (at least most of them), and the frontend needs to be implemented (currently a mock).
  • Pulling jobs over having a complex Queue/Scheduler is probably perfectly fine for now, and a FIFO queue can be achieved by sorting the jobs ASC by ID

Note: the screenshot shows things that aren't implemented yet, but also shows thing not that are implemented, like log streaming and a proper diff view like known from the campaignplans page.

image

Update: gif

demo

Known bugs:

  • When a campaign is attached, Creating changeset: error in GraphQL response: Base ref must be a branch prevent the changesets from being created

Merge request reports

Approval is optional

Closed by avatar (Jul 4, 2025 1:12pm UTC)

Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Created by: eseliger

    Related work on src-cli: (that code is even worse than in here, I started learning Go on the src-cli repo :grinning: )

    https://github.com/sourcegraph/src-cli/pull/139

  • Created by: eseliger

    Follow-up to your review @mrnugget:

    In this branch I implemented a prototype of running actions on multiple machines.

    Actions, as we know them from src-cli already, do have the problem that they, in their current implementation, don't scale too well for very large campaigns. Therefore, I went ahead writing a prototype that adds Sourcegraph into the process of generating those patches, while maintaing the previous means of creating campaigns, because it's not always beneficial to run them via Sourcegraph (debugging an action locally, running a small campaign on a local machine).

    For that I introduced these entities into the system:

    • Action, which represents one definition of an action to run, it holds the JSON defining the steps, and the associations to saved searches and campaigns.
    • ActionExecution, which is one instance of a src actions exec. It belongs to an action and inherits it's current steps definition. For an action, multiple executions can be triggered (a saved search triggers, a user manually wants to run the action again to update his campaign, ...). The parent action itself can be updated to fix issues, so across executions the definition may change, which is why it's saved "redundantly".
    • ActionJob, which is a single run of a steps definition over a single repository. What is known of src-cli as running the steps over many repositories, an ActionJob represents a single run over a repo, so all actionJobs of an execution represent what was previously one src actions exec command.

    They relate to each other roughly like this:

    Action 1--N ActionExecution 1--N ActionJob
    1. A user creates an action (currently via the UI, a src actions exec -remote flag is to be added)
    2. In the background, the action is created and therefrom, one execution is triggered.
    3. Upon execution creation, the scope query from the saved config is evaluated and one job for each repo is created and attached to the execution.
    4. Whenever a src actions runner is free to accept a job, it calls pullActionJob on the API, which, if any pending jobs exist, marks the job as running and returns all information needed for executing the defined steps.
    5. Every 5 seconds, the logs buffer from the src actions runner is streamed to the API using appendLog, saving it to the actionJob. Those logs also appear in the UI for further inspection.
    6. When a runner finished a job successfully, it calls updateActionJob, sets the status to completed and delivers any patches, should there have been some created.
    7. The updateActionJob handler checks if the action job was the last to finish. If so, the execution itself has finished and is ready for further processing.
    8. The createCampaignPlanFromPatches is invoked with the set of all patches for all jobs that were part of this execution.
    9. Optional: If the parent action has a campaign associated, it is automatically updated with the newly generated plan.

    Other optionals:

    • If an action_job failed, it is possible to try it by resetting it to the pending state. A runner will pick it up and, since the execution completed again, a new campaign plan will be created with the updated patches, should it have passed this time.

    Alternative entrypoints:

    1. The query-runner runs over the saved searches. If one triggers, the frontend is called at createActionExecutionsForSavedSearch.
    2. The frontend checks if any Action is associated to that saved search, if so, a new action execution gets created with the invokation_reason saved search, in the UI that will appear as "This execution ran because saved search changed it's results".
    3. as above

    (Not yet implemented):

    1. A periodic tasks checks the actions.schedule column in the DB if any scheduled executions should be triggered.
    2. An execution is created for those actions.
    3. as above
    • Job timeouts: We currently store runner_seen_at to use that as a kind of heartbeat. A periodic check should go by and mark those that were a long time without a heartbeat from a runner as canceled.
    • Job cancellation: Cancellation is supported in the frontend, but the src runner has to check that it's still a running task upon receiving the result of appendLog (which is also our heartbeat)
  • Created by: mrnugget

    To reduce scope: would it be a lot of work to remove the parts that are related to scheduled actions and saved searches?

  • Created by: eseliger

    No, scheduled actions are actually just having a db field to store it rn but no other code around it and saved searches have 1 api endpoint and 1 change to the query-runner, both can be pulled out easily

  • Created by: slimsag

    Please avoid using the term 'remote code execution' as it's often associated with security vulnerabilities and e.g. I gasped when I got the email notification for this :)

    Was there an RFC for this that I missed or is this the RFC itself so-to-speak? Where is the best place to leave feedback/questions/concerns?

  • Created by: eseliger

    hey there this was a weekend project just to try out how much effort it would be to implement such a feature, so we didn't yet come up with an RFC unfortunately. I just keep this branch up to date from time to time when we do larger code changes to upstream, but we aim to get a gql schema in for 3.15 and the actual implementation for it for 3.16-ish (depends on workload coming up this iteration). Any feedback is super welcome and I think the best place for it would be on this PR directly at this point, until we decide the moment has come to complete this and I can sketch out the idea in an RFC. :)

  • Created by: slimsag

    side note: query-runner is actually something I would like to remove entirely and put into the frontend as a singleton service running inside of there instead. Saved searches (which is what query-runner provides) is actually 100% managed by the frontend so query-runner is basically a dumb while loop with a sleep statement that performs all actions via the frontend's internal HTTP API: getting the defined saved searches, running them, sending email or slack notifications, everything is done through the frontend. So it's kinda crazy to have it in a separate query-runner service.

    Given that, you should consider from first-principles if it makes sense to have campaign running logic scheduled in a separate service as if query-runner did not exist or if it should also be done as a singleton service or via DB-based scheduling within the frontend monolith. I suspect probably the latter makes more sense given this information.

  • Created by: mrnugget

    I suspect a lot of things will change once we go from this PR to RFC to actual implementation, so I'll defer on getting into details, but I wanted to point out some high-level things that might make it easier to place this PR and understand what it is and isn't:

    1. "Remote code execution": ignore the name, think of it as "server-side execution of src action exec", which is what this is about.
    2. I wouldn't even say it's "arbitrary container execution" since the containers/commands that are executed are provided by the customers themselves. It's "arbitrary" from our perspective, but not from theirs.
    3. The main idea is: move src action exec from the local machine to a beefier machine and control it from the Sourcegraph UI.

    Think: buildkite-agent but for src-action-exec.

    Right now src action exec simply shells out to either execute a docker run or to run a shell command (provided by the user!). The idea is to keep the same execution model, but move it to a different machine that connects to Sourcegraph and receives "jobs" from it.

    The big question of what docker is being called is orthogonal to the changes in this PR, I think.

    Our idea was that customers can just use any machine they want and launch src action-agent (or something like that). If they do that in a DIND environment or a separate machine doesn't matter, as long as src action-agent can (1) receive jobs from Sourcegraph (2) shell out and call docker.

  • Created by: slimsag

    "Remote code execution": ignore the name, think of it as "server-side execution of src action exec", which is what this is about.

    I understand I myself can ignore the name and that this is just about server-side execution of campaigns. My point here is that we should avoid putting 'remote code execution' into anything (pull requests, issues, docs) unless it's a security vulnerability. As an example, I do not want someone searching for "Sourcegraph remote code" to turn up results for "Spike on remote code execution" in Google because that looks scary. It's not a huge deal, but good to avoid in general is my point. Any objection to renaming the PR "server-side execution of src action exec" ?

  • Created by: eseliger

    absolutely not!

  • Created by: codecov[bot]

    Codecov Report

    Merging #8574 into master will decrease coverage by 1.36%. The diff coverage is 20.44%.

    @@            Coverage Diff             @@
    ##           master    #8574      +/-   ##
    ==========================================
    - Coverage   44.26%   42.90%   -1.37%     
    ==========================================
      Files        1411     1399      -12     
      Lines       77383    77679     +296     
      Branches     6959     6794     -165     
    ==========================================
    - Hits        34257    33331     -926     
    - Misses      39907    41320    +1413     
    + Partials     3219     3028     -191     
    Flag Coverage Δ
    #go ?
    #typescript ?
    #unit 42.90% <20.44%> (-1.37%) :arrow_down:
    Impacted Files Coverage Δ
    cmd/frontend/graphqlbackend/campaigns.go 0.00% <0.00%> (ø)
    cmd/query-runner/graphql.go 0.00% <0.00%> (ø)
    cmd/query-runner/main.go 0.00% <0.00%> (ø)
    ...internal/campaigns/resolvers/action_definitions.go 0.00% <0.00%> (ø)
    .../internal/campaigns/resolvers/action_executions.go 0.00% <0.00%> (ø)
    ...rprise/internal/campaigns/resolvers/action_jobs.go 0.00% <0.00%> (ø)
    enterprise/internal/campaigns/resolvers/actions.go 0.00% <0.00%> (ø)
    enterprise/internal/campaigns/resolvers/agents.go 0.00% <0.00%> (ø)
    enterprise/internal/campaigns/workers.go 49.14% <ø> (-0.28%) :arrow_down:
    internal/campaigns/types.go 9.80% <0.00%> (-1.97%) :arrow_down:
    ... and 418 more
  • Created by: chrispine

    Closing because this is out of date at this point. Once we look at server-side execution again, we can use this as a reference.

Please register or sign in to reply
Loading