Add "weaknesses found" analytic to chaos hub #4517

umamukkara · 2020-06-28T19:20:35Z

umamukkara
Jun 28, 2020
Maintainer

Litmus chaos hub displays the operators installed and experiments run. When an experiment fails, it indicates (or should indicate) a weakness in the system. This can be counted at experiment level so that community can track various analytics.

For example:

Which is the experiment that yielded most weaknesses?
Which experiment in a given chart/application has yielded most weaknesses?

umamukkara · 2020-06-30T03:51:19Z

umamukkara
Jun 30, 2020
Maintainer Author

Experiments can also fail because of configuration issues in the experiment itself. We need to filter out those from being counted as - finding a weakness.

0 replies

cpitstick-argo · 2020-07-14T19:15:27Z

cpitstick-argo
Jul 14, 2020

This is a huge, gigantic, very complex thought, and might be better of being its own complete, separate (though perhaps tightly partnered with) framework rather than bundled into Litmus

First -- Success or failure of an experiment may or may not indicate a problem. In some cases, the failure of an experiment may be an expected result (such as in a stress test situation). In other cases, the success of an experiment may be an anomaly. You will not be able to replace human analysis of what's going on here. So I would recommend that any attempt at figuring out what matters and what doesn't be left to the humans owning the systems. You could try to build some ML around analyzing this, but that would likely need to be very experimental for a very long time.

Second -- How is this different than monitoring? I've sent this message to the Litmus community before that monitoring needs to be a pillar of the design of Litmus. Would this have a custom Flask service, for instance, that lays out the "weaknesses" using some sort of heatmap? Report a ton more metrics to Prometheus? Export a lot of additional events, have custom Grafana templates for organizations to consume into their monitoring, something else?

What about alerting? How would you configure an alert for a "severe" weakness (such as no AZ redundancy) versus a "minor" weakness (such as a readiness probe with a timeout that is too long)? Where do you even draw the line? This is as much of a policy/legal question as it is an engineering one, and as you point out gets into the problem of noise.

Finally -- Here's what IS useful about this idea is that instead of "an experiment failure means a weakness," instead build a resilience auditor. Something that dumps your Kubernetes manifests and determines if they are "litmus compliant" or even simply have best practices implemented. Then generate a report based on that.

0 replies

ksatchit · 2020-07-15T02:36:00Z

ksatchit
Jul 15, 2020
Maintainer

Thanks @cpitstick-argo ! A couple of thoughts that match right away:

More metrics, more events (experiment specific ones). Some enhancements to the chaosresult and what we push to it, etc., also are part of this. Making some slow/incremental progress here. Alerting is a by-product, IMO, and more of leveraging a lot of tools/mechanisms already existing.
Grafana templates for apps instrumented w/ litmus metrics. An attempt w/ sample apps like Sock-Shop is ongoing. We will report back to this thread.

The resilience auditor, i.e., a static resource resiliency/best-practices/health checker is definitely an interesting one. Some thoughts around calling this out via OpenPolicyAgent was discussed some time ago, though we haven't added more work in that direction yet. But flagging this even before it makes it to the cluster is very cool indeed.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "weaknesses found" analytic to chaos hub #4517

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Add "weaknesses found" analytic to chaos hub #4517

umamukkara Jun 28, 2020 Maintainer

Replies: 3 comments

umamukkara Jun 30, 2020 Maintainer Author

cpitstick-argo Jul 14, 2020

ksatchit Jul 15, 2020 Maintainer

umamukkara
Jun 28, 2020
Maintainer

umamukkara
Jun 30, 2020
Maintainer Author

cpitstick-argo
Jul 14, 2020

ksatchit
Jul 15, 2020
Maintainer