Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support executing terraform commands via k8s jobs #3791

Open
1 task done
george-zubrienko opened this issue Sep 25, 2023 · 13 comments
Open
1 task done

Support executing terraform commands via k8s jobs #3791

george-zubrienko opened this issue Sep 25, 2023 · 13 comments
Labels
feature New functionality/enhancement needs discussion Large change that needs review from community/maintainers Stale

Comments

@george-zubrienko
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Describe the user story
I have seen #260 running for a while with multiple attempts at resolving the issue with project locality in Atlantis. Our organization also has issues scaling atlantis as single-node machine hosting it is not able to cope with all the applies/plans coming in. There are also issues with provider cache sharing, version locking, plan storage and resource usage when both webserver and terraform itself runs in the same container/pod.

This can be partially worked around by installing several atlantis charts, not using a monorepo and configuring a new webhook for each, but that process has limits to its scalability as number of projects and resources in them will anyway exceed number of repos, unless you do a repo-per-project.

I want to propose a solution to this, as an opt-in feature, that should allow Atlantis to scale horizontally for larger organization and teams than the current solution is realistically feasible to work with without installing several charts/webhooks.

Describe the solution you'd like
I propose an option to return a specialised CommandRunner - looking at this code piece. I think it should be feasible to run that code in 2 modes:

  • current (default) mode when event controller creates a command runner
  • worker (new) mode when even controller spawns a k8s job which:
    • uses the same image or image without webserver component
    • executes the event handler logic using the arguments provided from event controller
    • uses RWM volume mount to save the state like generated plans etc.

in values.yaml this could be smth like:

remote-execution:
  enabled: true
  stateVolumeName: my-volume
  workerAffinity: {}
  workerTolerations: {}
  workerResources: {}
  workerLimits: {}

if enabled, this flag should also add a CRD to the namespace atlantis is installed in:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: worker-job.atlantis.runatlantis.io
spec:
  group: atlantis.runatlantis.io
  scope: Namespaced
  names:
    plural: worker-jobs
    singular: worker-job
    kind: AtlantisWorker
    shortNames:
      - aw
  versions:
    - name: v1beta1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                metadata:
                  type: object
                  properties:
                    labels:
                      type: object
                      additionalProperties:
                        type: string
                      nullable: true
                    annotations:
                      type: object
                      additionalProperties:
                        type: string
                      nullable: true
                  description: Job metadata
                template:
                  type: object
                  x-kubernetes-embedded-resource: true
                  x-kubernetes-preserve-unknown-fields: true

This will allow storing whole k8s job template on the cluster. Now the event handling flow will require an adjustment, I tried to put this into a mermaid:

graph TD;  
    A[event] --> B{remote execution enabled?};  
    B -->|No| C[local runner];  
    B -->|Yes| X;  
    X --> D[Prepare EventContext];  
    X --> E[resolve CommandType];  
    X --> F[read job template in RemoteJob object];  
    D --> G[set `cmd` and `args` in container spec];  
    E --> G;  
    F --> G;  
    G --> H[Send generated RemoteJob to the cluster];  
    H --> I[Wait for RemoteJob to complete];  
    I -->|Receive HTTP POST from job| J[Job Completed];  
    I -->|Check status on cluster| J;  
    J --> K[Send a Webhook Event data];  
Loading

Note that with the proposed mode (adding a new CommandRunner) worker will be responsible for VCS communication as that class has VcsClient, so maybe actually checking the job result is not needed at all.

Describe the drawbacks of your solution
I do not see a lot of challenges with maintaining the k8s integration itself, as Batch API has been very stable and has plenty of features that Atlantis can make use of in the future like suspend could be used for delayed apply. Only detail here is that every minor k8s release brings some exiciting stuff to use, so if Atlantis uses it, we'll be forced to add k8s feature compatibility matrix and decide on how we support different k8s versions and what should be available depending on the version. However, current architecture is not very friendly towards supporting remote exec and it would require some effort to add a new feature so it can work alongside existing functionality, or work selectively based on k8s version of the cluster Atlantis is deployed to.

Then, this adds CRD to the chart which comes with its own fun like migrations. This could be avoided by moving base template to app config instead.
However, running an external runner requires an image, and regardless of the route chosen this adds maintenance. If we go with a single image as now, we'll have to add support for "cli" mode on top of current "webserver" mode. If we go with 2 images, this adds a lot of chores to add a second image generation and publish it etc. Plus some people might want to run their own image, so they will come throw issues to support that, so a bit of a pandora box here.

Last, but not least, running an external process in an environment like k8s always come with a cost of investing into bookkeeping. What happens if the job fails to execute the command? How to handle exit 137 or other special exit codes when container might not be able to communicate its status gracefully? Most likely we'll need some sort of "garbage collector", or where I work, we call them "maintainers", which is another app instance that handles this edge cases. Note this is not about removing jobs, as TTL controller handles that no problem, but rather about situations when somebody runs atlantis plan and gets silence in return because the launched job has crashed due to app misconfiguration etc.

Overall, I think all those are manageable, but no doubt this adds a new level of complexity to the app and will require more maintenance than before.

Describe alternatives you've considered
The alternatives will always be somewhere around the idea of either multithreading or doing replicaCount: >1. I'd say the latter would be great, if possible and easier to implement compared to k8s jobs.

@george-zubrienko george-zubrienko added the feature New functionality/enhancement label Sep 25, 2023
@george-zubrienko
Copy link
Author

fyi @s-vitaliy

@jamengual
Copy link
Contributor

@george-zubrienko thanks for the detailed description, there is a lot to unpack here.

I will recommend looking at https://github.com/lyft/atlantis and their fork to get some ideas, they use Temporal as a scheduler/worker queue type system, maybe there is something there that can be reused and upstreamed back to atlantis.

just a suggestion.

@jamengual jamengual added the needs discussion Large change that needs review from community/maintainers label Sep 25, 2023
@george-zubrienko
Copy link
Author

george-zubrienko commented Sep 27, 2023

I'll take a look on the weekend and circle back here, thank you!

Another step would be, from my end, to propose concerete tasks to implement and adjust the list following the discussion.

@jamengual
Copy link
Contributor

yes in a way to build a backlog.
We could use a roadmap for that and tag individual issues to the roadmap but that will be after we agree in an architecture

@WarpRat
Copy link

WarpRat commented Oct 19, 2023

We actually recently did this ourselves just using a few bash scripts and redis for passing completed plans and command output back to Atlantis. It's been working well for us and let us greatly reduce the footprint of our Atlantis pod. I had hoped to rewrite the bash in a small go utility and publish the code somewhere public but haven't had a chance with shifting priorities at work, I'd be happy to share some details of how we approached it although it's not much more than a proof of concept currently. This would be a great feature to have available natively in Atlantis.

@jamengual
Copy link
Contributor

jamengual commented Oct 19, 2023 via email

@george-zubrienko
Copy link
Author

george-zubrienko commented Nov 20, 2023

@jamengual sorry for being out a bit, I'm coming back here and I'm going to increase my activity on this one until we are able to have some shape for the solution. I had a bit of a break from OSS since end of October, so hadn't yet look at lyft stuff - will do shortly.

I have a small suggestion/question. I see already several people are able to implement some sort of a solution by adding a "proxy layer" between atlantis and VCS. Thus I had this idea that maybe it will be cheaper if we add some sort of "workload manager" that for Atlantis looks like VCS provider API, but in reality acts as a proxy? That will allow doing multiple replicas on atlantis itself, as long as the proxy can split the work between them?

This will be (potentially) easier to implement and will be fully opt-in based, if proxy is not enabled, people run vanilla mode. Also, this way Atlantis core does not have to be changed at all and thus a lot of work involved in aligning changes with other commits is not required.

UPD: This is somewhat similar to Lyft's Gateway, in case what I wrote is confusing. However I believe this can be simplified if we follow the Atlantis mode (PR -> multiple jobs = TF runs) instead the Lyfts model (revisions -> queue -> TF run)

@jamengual
Copy link
Contributor

Yes, that is a possibility, I guess getting to a POC level and seeing how that could work will be good for understanding the whole flow.

@george-zubrienko
Copy link
Author

george-zubrienko commented Nov 21, 2023

Alright I'll try to conjure a PoC that will function roughly this way:

  • Separate app that receives GH webhook events and links + routes them to the chosen Atlantis pod (atlantis-0,1,...). If no pod is currently free, it will scale up the stateful set to n+1 replicas and route the new commit to the new one.
  • App will maintain state of approx. PR-Commit-TFCommand-Status-Atlantis Pod and update that accordingly based on Atlantis-initiated pushes when commands finish execution. This way it should not need a queue as it will rely on exclusive "locking" of pods that serve each PR>commits. A colleague of mine also suggested we can use a CRD that lives while the PR is active that contains the mapping information about which Atlantis pod serves which PR. That way we don't need any additional storage integration

It will take a couple of weeks to come with prototype, I have a bit of vacation end of December, so I hope I can present something in Jan 2024 :)

@george-zubrienko
Copy link
Author

george-zubrienko commented Dec 19, 2023

@jamengual please take a look at rough draft: main...SneaksAndData:atlantis:gateway. Note this is my first go project to work on, so if some stuff seems fishy, please point it out :)

Also, disclaimer, by no means this has been tested yet or is a complete version. I send a diff just to probe if the idea resonates with the contributors/community well enough before starting any e2e tests.

This requires one more PR to atlantis Helm chart to change the ingress. TLDR, the implementation is as described below.

Atlantis Job Mode

Optional deployment option of a helm chart. Does not affect or modify any Atlantis code. Diff above contains code for Github VCS only for now.

Job mode changes

Should be enabled from jobMode: enabled in helm values. For now I just provide an example of how the template will look like and the PVC it needs. Enabling it changes the following:

  • StatefulSet is not deployed
  • ConfigMap containing pod spec for Atlantis Server is deployed
  • Atlantis data volume is changed to RWM instead of RWO (EFS, Azure Files, other NFS-like services)
  • A new deployment Atlantis Proxy is added
  • Ingress for Atlantis Server is not deployed in this mode
  • A new ingress for Atlantis Proxy is deployed, that routes VCS events to respective pods
  • Atlantis Proxy will associate PR with a k8s Job using labels/annotations. Not decided on the locate-pod-approach - you can see from code I also use names directly, but that might be a bad idea gives those come from PR Urls). It has a single purpose of finding out which Job -> Pod -> IP runs the Atlantis Server that servers the PR. If one exists, it will do a POST to the target IP /events endpoint
func (ers *DefaultEventRoutingService) RoutePullRequest(target models.PullRequestAssociation, webhookRequest *http.Request) (resp *http.Response, err error) {
	return http.Post(target.AssociatedHost().EventsUrl(), "application/json", webhookRequest.Body)
}

If not, it will create a new Job with Atlantis Server and label/annotation, wait for the pod to come up and route the event there. All events are posted to the channel the host runs and processed sequentially (assuming I get this part of Go correctly :) )

In addition, each Job binds its atlantis-data dir to a dedicated path on a fileshare. As you can see in the job example, I've set deadline to 24 hours - in order to conserve hosts that serve long-living PRs. Storing data in a fileshare allows to recover the PR on another host - again, assuming I understood the core code correctly.

Job mode capabilities

The way it is implemented, only thing job mode does, it provides a way to handle multiple PRs in multiple repos concurrently without running into issues with provider code locks, or performance issues when using a single Atlantis host.

This implementation does not provide any queuing capabilities, so the deployment is still subject to state file lock conflicts, if multiple PR target the same TF dir.

@jamengual
Copy link
Contributor

I'm going on PTO tomorrow, so I will be on/off looking at things but @GenPage @chenrui333 @nitrocode can review this too and have more experience in k8s than me.

@pseudomorph
Copy link
Contributor

I've only skimmed the entirety of this, so I may be missing the whole context. But, would it make more sense to build a remote execution framework (akin to Terraform Cloud worker nodes) which is not strictly tied to kube and build the kube glue around that?

Just in case there are others who might be wanting to use a different compute platform, but achieve the same results.

Apologies if I'm way off base here.

@george-zubrienko
Copy link
Author

@pseudomorph that's a reasonable suggestion. My main reason for going this way is that it requires less effort, but still covers the majority of atlantis installations - in the kube. Proxy is not tied to Atlantis as-is and serves just as a request relay layer to go around Atlantis horizontal scalability issues. I'd consider this "v0" implementation that just allows people who install Atlantis in their kubes via helm to get a bit more scalability out of box, if they need it.

When talking a bit more long-term, then a "remote execution framework", either homegrown, or imported would be a more appropriate solution so non-kube cases can be covered as well (Nomad? :D) I'm a total kube-brain and I'm trying to solve our internal issue, but I also feel like we can contribute it to the upstream.

I have now a strict deadline for this issue in my work plan (end of March 2023), so I'll be doing some e2e testing of this one soon.

@dosubot dosubot bot added the Stale label Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New functionality/enhancement needs discussion Large change that needs review from community/maintainers Stale
Projects
None yet
Development

No branches or pull requests

4 participants