Add "weaknesses found" analytic to chaos hub #4517
Replies: 3 comments
-
Experiments can also fail because of configuration issues in the experiment itself. We need to filter out those from being counted as - finding a weakness. |
Beta Was this translation helpful? Give feedback.
-
This is a huge, gigantic, very complex thought, and might be better of being its own complete, separate (though perhaps tightly partnered with) framework rather than bundled into Litmus First -- Success or failure of an experiment may or may not indicate a problem. In some cases, the failure of an experiment may be an expected result (such as in a stress test situation). In other cases, the success of an experiment may be an anomaly. You will not be able to replace human analysis of what's going on here. So I would recommend that any attempt at figuring out what matters and what doesn't be left to the humans owning the systems. You could try to build some ML around analyzing this, but that would likely need to be very experimental for a very long time. Second -- How is this different than monitoring? I've sent this message to the Litmus community before that monitoring needs to be a pillar of the design of Litmus. Would this have a custom Flask service, for instance, that lays out the "weaknesses" using some sort of heatmap? Report a ton more metrics to Prometheus? Export a lot of additional events, have custom Grafana templates for organizations to consume into their monitoring, something else? What about alerting? How would you configure an alert for a "severe" weakness (such as no AZ redundancy) versus a "minor" weakness (such as a readiness probe with a timeout that is too long)? Where do you even draw the line? This is as much of a policy/legal question as it is an engineering one, and as you point out gets into the problem of noise. Finally -- Here's what IS useful about this idea is that instead of "an experiment failure means a weakness," instead build a resilience auditor. Something that dumps your Kubernetes manifests and determines if they are "litmus compliant" or even simply have best practices implemented. Then generate a report based on that. |
Beta Was this translation helpful? Give feedback.
-
Thanks @cpitstick-argo ! A couple of thoughts that match right away:
The resilience auditor, i.e., a static resource resiliency/best-practices/health checker is definitely an interesting one. Some thoughts around calling this out via OpenPolicyAgent was discussed some time ago, though we haven't added more work in that direction yet. But flagging this even before it makes it to the cluster is very cool indeed. |
Beta Was this translation helpful? Give feedback.
-
Litmus chaos hub displays the operators installed and experiments run. When an experiment fails, it indicates (or should indicate) a weakness in the system. This can be counted at experiment level so that community can track various analytics.
For example:
Beta Was this translation helpful? Give feedback.
All reactions