Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing break the glass as a principle #38

Closed
wants to merge 4 commits into from

Conversation

grmhay
Copy link

@grmhay grmhay commented Oct 12, 2021

We (Morgan Stanley) believe that the situation where the source of truth for desired state (e.g. github.com or a git-equivalent that an enterprise may run) is less available than your users' expected SLA for making configuration changes is being left by the community as an issue for the implementer to overcome.
Put succinctly, if Github is unavailable and you want to make changes to your System State, there should be one approach and a set of tooling to allow reconciliation after the fact.
This will both harm adoption of gitops and is inefficient as I believe we shared a common challenge that we can solve once within the project.
The first step, as this project has so well established, is a glossary of terms to allow us to describe the problem and a draft principle to add. I have included these in this PR.

Comment on lines +24 to +26
5. **Manageable "always"**

Desired state is able to be updated according to users' SLA expectations to update system state, even if the "source" is unavailable.
Copy link
Contributor

@lloydchang lloydchang Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grmhay Thank you for the pull request.

Sorry if I'm misunderstanding you but the PRINCIPLES.md section of this pull request...

Manageable "always"

Desired state is able to be updated according to users' SLA expectations to update system state, even if the "source" is unavailable.

... seems to assume that the source should be centrally managed or always managed with SLA expectations.

In the scenario that you described, it seems that GitHub is used for GitOps:

We (Morgan Stanley) believe that the situation where the source of truth for desired state (e.g. github.com or a git-equivalent that an enterprise may run) is less available than your users' expected SLA for making configuration changes is being left by the community as an issue for the implementer to overcome.
Put succinctly, if Github is unavailable and you want to make changes to your System State, there should be one approach and a set of tooling to allow reconciliation after the fact.
This will both harm adoption of gitops and is inefficient as I believe we shared a common challenge that we can solve once within the project.
The first step, as this project has so well established, is a glossary of terms to allow us to describe the problem and a draft principle to add. I have included these in this PR.

My concern is that the proposed principle, as written, seems to presuppose GitOps only running as a centralized system and always managed with an SLA.

While GitHub can be centrally managed with an SLA, Git isn't centrally managed at all.

The proposed principle, as written, seems to exclude non-centralized usages of GitOps, Git, Kubernetes, etc.

While GitOps doesn't require Git, I am listing Git below because you referenced Git earlier...

• Git, by design, is a distributed revision control system (DVCS), and not managed as a centralized system

Since we are discussing principles, which needs to be applicable in many scenarios... Centralized management wouldn't work in disconnected scenarios, such as:

• Kubernetes on fighter jets, e.g. https://www.cncf.io/blog/2021/09/30/how-to-get-robust-gitops-the-u-s-department-of-defense-uses-flux-and-helm/

• Kubernetes at in-store point of sales systems, e.g. https://www.cncf.io/blog/2021/02/19/how-a-4-billion-retailer-built-an-enterprise-ready-kubernetes-platform-powered-by-linkerd/

• Kubernetes in air-gapped environments, e.g. https://github.com/cncf/cnf-testsuite/blob/main/AIRGAP.md

• Kubernetes at the edge, e.g. https://www.cncf.io/blog/2021/05/04/kubernetes-at-the-edge-organizations-are-using-edge-technologies-but-there-is-room-to-grow/

While GitOps doesn't require Kubernetes, I listed Kubernetes in links above because Kubernetes is a CNCF project.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lloydchang. Appreciate your feedback and apologies for the delay in replying - Kubecon then a couple of days off. I work in a large enterprise without disconnected scenarios so it is great to collaborate with someone who has a different perspective! Reflecting on principle #3 "Software agents automatically pull the desired state declarations from the source." Our problem is if the the desired state in the "source" on the "state store" (usage of terms I believe per the Glossary) is less available than the desired SLA the users have to change the desired state of the "Software System", we have a problem.

Reflecting on most of the answers at GitOpsCon to this question that I put to end user organization presenters, this problem is either ignored ("well if Git/Gitlab/... is down, we can't make cluster changes") or unsolved and I believe that will end up in a bad place for GitOps.

I think actually with your example of the disconnected scenario, doesn't the problem I, in the enterprise, outline become even more acute? What happens if you are seeking to update the desired state of a Kubernetes cluster (example software system) but the "state store" is unavailable (e.g. WAN connection down to a branch office holding the cluster). You just can't change the cluster config? Or you break glass and change the cluster config then you are left to reconcile the desired state expression on the "state store" manually to what is on your cluster.

Note: I also have to fix my commits to have DCO signoff so I'll amend my commit based on your feedback and please continue the conversation against my new PR

@lloydchang
Copy link
Contributor

@grmhay Generally re: DCO, I found the the following useful. Thank you!

@scottrigby wrote in gitops-working-group/gitops-working-group#117

just need to amend your commit to add your DCO signoff

PS, you may be interested in this fun plugin I made to help with this when using the GitHub UI 😸 https://github.com/scottrigby/dco-gh-ui/

I made this because this issue eventually happens to almost everyone contributing to CNCF/LF projects that require DCO.

Then, specifically to this #38

From https://github.com/open-gitops/documents/pull/38/checks?check_run_id=3876293036

DCO

You have 4 commits incorrectly signed off. To fix, first ensure you have a local copy of
your branch by checking out the pull request locally via command line. Next, head to vyour local branch and run:

git rebase HEAD~4 --signoff

Now your commits will have your sign off. Next run
git push --force-with-lease origin main

Commit sha: ff2d6ff, Author: Graeme Hay, Committer: GitHub; The sign-off is missing.
Commit sha: 3f3b575, Author: Graeme Hay, Committer: GitHub; The sign-off is missing.
Commit sha: 1181a09, Author: Graeme Hay, Committer: GitHub; The sign-off is missing.
Commit sha: a6b12ea, Author: Graeme Hay, Committer: GitHub; The sign-off is missing.

@lloydchang
Copy link
Contributor

TL;DR: Break Glass was already added in RC1 draft at #21, then subsequently removed in RC2 draft at #22.


Thank you @grmhay for your pull request. I hope my feedback is useful.

It appears the topics "break the glass" and "Break Glass" were discussed in past pull requests #21, #22, and meetings July 28th, July 7th, May 19th.

Thank you for your patience because the meeting recordings after May 5th haven't been uploaded yet. Below are more details:


From @todaywasawesome at #22

Two of the larger items were to remove "Break glass" and its definition.

Break Glass
The temporary suspension of GitOps principles, often accomplished by pausing automated reconciliation. While these principles apply to typical operations, it may at times be necessary to temporarily pause reconciliation, for example during incident management activities. In these cases, other modes of operations should be considered (e.g. manual intervention), followed by any necessary updates to the desired state declarations, and finally resuming reconciliation of the system with the updated declarations. Pragmatic exceptions to these guiding principles are expected from time to time during the journey toward a system being fully managed by GitOps.

In the case of break glass, this is a term that was in a previous version of the principles. The discussion of when it's ok to break glass is a large one and something that is likely better addressed in a whitepaper or a blog post. It's such an important topic that having it backended into the glossary felt like too big of an item and viewpoint to be added there.


Related, below are from meeting notes in https://docs.google.com/document/d/1hxifmCdOV5_FbKloDJRWZQHq0ge-trXJKF-BgV4wHVk/edit


July 28, 2021 - Principles weekly

https://docs.google.com/document/d/1hxifmCdOV5_FbKloDJRWZQHq0ge-trXJKF-BgV4wHVk/edit#heading=h.78a62j9lu1di


July 07, 2021 - Principles Weekly

  • Action items:
    • Break Glass definition

https://docs.google.com/document/d/1hxifmCdOV5_FbKloDJRWZQHq0ge-trXJKF-BgV4wHVk/edit#heading=h.x64w5rcdr8yl


May 19, 2021 - Principles Weekly

  • Topics:

    • Next steps
      • Additional notes
        • When to break the glass on GitOps
  • Action items:

    • @bricef Brice: When to break the glass on GitOps

https://docs.google.com/document/d/1hxifmCdOV5_FbKloDJRWZQHq0ge-trXJKF-BgV4wHVk/edit#heading=h.rkp18j3m2810


Meeting recordings are uploaded to https://www.youtube.com/channel/UCI6iqYuuI4gZuOCZaks5i1g/videos

Status of meeting recordings: My understanding from @scottrigby via #wg-gitops Slack channel at https://cloud-native.slack.com/archives/C01G9DEE88M/p1634261795135100 is:

Recordings are in the process of being uploaded to YouTube. @chris-short has generously been volunteering to upload those, and @scottrigby thinks @chris-short may be waiting on @scottrigby to verify them. Maybe this can be split up between multiple people on the media team. Apologies for the delay

Thank you @chris-short and @scottrigby for your time for uploading meeting recordings.

@lloydchang
Copy link
Contributor

@grmhay grmhay closed this Oct 20, 2021
@grmhay
Copy link
Author

grmhay commented Oct 20, 2021

Will open a new PR now I know what a DCO signoff is! Sorry...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants