Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making the github arc runner alarm more generic and scoped to runners #1371

Merged
merged 3 commits into from
Jun 11, 2024

Conversation

ben851
Copy link
Contributor

@ben851 ben851 commented Jun 11, 2024

Summary | Résumé

The github arc runner alarm failed to trigger before. This alarm is more generic, but also scoped directly to the arc runners.

Related Issues | Cartes liées

Incident action

Test instructions | Instructions pour tester la modification

TF Apply works

Release Instructions | Instructions pour le déploiement

None.

Reviewer checklist | Liste de vérification du réviseur

  • This PR does not break existing functionality.
  • This PR does not violate GCNotify's privacy policies.
  • This PR does not raise new security concerns. Refer to our GC Notify Risk Register document on our Google drive.
  • This PR does not significantly alter performance.
  • Additional required documentation resulting of these changes is covered (such as the README, setup instructions, a related ADR or the technical documentation).

⚠ If boxes cannot be checked off before merging the PR, they should be moved to the "Release Instructions" section with appropriate steps required to verify before release. For example, changes to celery code may require tests on staging to verify that performance has not been affected.

@ben851 ben851 requested a review from a team June 11, 2024 17:34
@ben851 ben851 requested a review from jimleroyer as a code owner June 11, 2024 17:34
Copy link

Updating alarms ⏰? Great! Please update the Google Sheet and add a 👍 to this message after 🙏

1 similar comment
Copy link

Updating alarms ⏰? Great! Please update the Google Sheet and add a 👍 to this message after 🙏

Copy link

Staging: eks

✅   Terraform Init: success
✅   Terraform Validate: success
✅   Terraform Format: success
✅   Terraform Plan: success
✅   Conftest: success

⚠️   Warning: resources will be destroyed by this change!

Plan: 2 to add, 0 to change, 2 to destroy
Show summary
CHANGE NAME
add aws_cloudwatch_log_metric_filter.github-arc-runner-alarm[0]
aws_cloudwatch_metric_alarm.github-arc-runner-error-alarm[0]
delete aws_cloudwatch_log_metric_filter.github-arc-write-alarm[0]
aws_cloudwatch_metric_alarm.github-arc-runner-write-alarm[0]
Show plan
Resource actions are indicated with the following symbols:
  + create
  - destroy

Terraform will perform the following actions:

  # aws_cloudwatch_log_metric_filter.github-arc-runner-alarm[0] will be created
  + resource "aws_cloudwatch_log_metric_filter" "github-arc-runner-alarm" {
      + id             = (known after apply)
      + log_group_name = "/aws/containerinsights/notification-canada-ca-staging-eks-cluster/application"
      + name           = "GitHub ARC Runners Write Alarm"
      + pattern        = "{ $.kubernetes.pod_name = \"github-arc-ss-staging-*-runner-*\"  && $.log = \"*ERROR*\" }"

      + metric_transformation {
          + name      = "aggregating-github-arc-runner-alarm"
          + namespace = "LogMetrics"
          + unit      = "None"
          + value     = "1"
        }
    }

  # aws_cloudwatch_log_metric_filter.github-arc-write-alarm[0] will be destroyed
  # (because aws_cloudwatch_log_metric_filter.github-arc-write-alarm is not in configuration)
  - resource "aws_cloudwatch_log_metric_filter" "github-arc-write-alarm" {
      - id             = "GitHub ARC Runners Write Alarm" -> null
      - log_group_name = "/aws/containerinsights/notification-canada-ca-staging-eks-cluster/application" -> null
      - name           = "GitHub ARC Runners Write Alarm" -> null
      - pattern        = "WRITE ERROR" -> null

      - metric_transformation {
          - dimensions = {} -> null
          - name       = "aggregating-github-arc-write-alarm" -> null
          - namespace  = "LogMetrics" -> null
          - unit       = "None" -> null
          - value      = "1" -> null
        }
    }

  # aws_cloudwatch_metric_alarm.github-arc-runner-error-alarm[0] will be created
  + resource "aws_cloudwatch_metric_alarm" "github-arc-runner-error-alarm" {
      + actions_enabled                       = true
      + alarm_actions                         = [
          + "arn:aws:sns:ca-central-1:239043911459:alert-critical",
        ]
      + alarm_description                     = "GitHub ARC Runners Are Failing"
      + alarm_name                            = "github-arc-runner-error-alarm"
      + arn                                   = (known after apply)
      + comparison_operator                   = "LessThanThreshold"
      + evaluate_low_sample_count_percentiles = (known after apply)
      + evaluation_periods                    = 1
      + id                                    = (known after apply)
      + metric_name                           = "aggregating-github-arc-runner-alarm"
      + namespace                             = "LogMetrics"
      + ok_actions                            = [
          + "arn:aws:sns:ca-central-1:239043911459:alert-critical",
        ]
      + period                                = 300
      + statistic                             = "Sum"
      + tags_all                              = (known after apply)
      + threshold                             = 1
      + treat_missing_data                    = "notBreaching"
    }

  # aws_cloudwatch_metric_alarm.github-arc-runner-write-alarm[0] will be destroyed
  # (because aws_cloudwatch_metric_alarm.github-arc-runner-write-alarm is not in configuration)
  - resource "aws_cloudwatch_metric_alarm" "github-arc-runner-write-alarm" {
      - actions_enabled           = true -> null
      - alarm_actions             = [
          - "arn:aws:sns:ca-central-1:239043911459:alert-critical",
        ] -> null
      - alarm_description         = "GitHub ARC Runners Are Failing - Check Version Deprecation" -> null
      - alarm_name                = "github-arc-runner-write-alarm" -> null
      - arn                       = "arn:aws:cloudwatch:ca-central-1:239043911459:alarm:github-arc-runner-write-alarm" -> null
      - comparison_operator       = "LessThanThreshold" -> null
      - datapoints_to_alarm       = 0 -> null
      - dimensions                = {} -> null
      - evaluation_periods        = 1 -> null
      - id                        = "github-arc-runner-write-alarm" -> null
      - insufficient_data_actions = [] -> null
      - metric_name               = "aggregating-github-arc-write-alarm" -> null
      - namespace                 = "LogMetrics" -> null
      - ok_actions                = [
          - "arn:aws:sns:ca-central-1:239043911459:alert-critical",
        ] -> null
      - period                    = 300 -> null
      - statistic                 = "Sum" -> null
      - tags                      = {} -> null
      - tags_all                  = {} -> null
      - threshold                 = 1 -> null
      - treat_missing_data        = "notBreaching" -> null
    }

Plan: 2 to add, 0 to change, 2 to destroy.

─────────────────────────────────────────────────────────────────────────────

Saved the plan to: plan.tfplan

To perform exactly these actions, run the following command to apply:
    terraform apply "plan.tfplan"
Show Conftest results
WARN - plan.json - main - Missing Common Tags: ["aws_acm_certificate.client_vpn"]
WARN - plan.json - main - Missing Common Tags: ["aws_acm_certificate.notification-canada-ca"]
WARN - plan.json - main - Missing Common Tags: ["aws_acm_certificate.notification-canada-ca-alt[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_alb.notification-canada-ca"]
WARN - plan.json - main - Missing Common Tags: ["aws_alb_listener.internal_alb_tls"]
WARN - plan.json - main - Missing Common Tags: ["aws_alb_listener.notification-canada-ca"]
WARN - plan.json - main - Missing Common Tags: ["aws_alb_target_group.internal_nginx_http"]
WARN - plan.json - main - Missing Common Tags: ["aws_alb_target_group.notification-canada-ca-admin"]
WARN - plan.json - main - Missing Common Tags: ["aws_alb_target_group.notification-canada-ca-api"]
WARN - plan.json - main - Missing Common Tags: ["aws_alb_target_group.notification-canada-ca-document"]
WARN - plan.json - main - Missing Common Tags: ["aws_alb_target_group.notification-canada-ca-document-api"]
WARN - plan.json - main - Missing Common Tags: ["aws_alb_target_group.notification-canada-ca-documentation"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_log_group.notification-canada-ca-eks-application-logs[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_log_group.notification-canada-ca-eks-cluster-logs[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_log_group.notification-canada-ca-eks-prometheus-logs[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.admin-evicted-pods[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.admin-pods-high-cpu-warning[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.admin-pods-high-memory-warning[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.admin-replicas-unavailable[0]"]
WARN - plan.json - main - Missing Common Tags:...

@ben851 ben851 merged commit 24eb090 into main Jun 11, 2024
3 checks passed
@ben851 ben851 deleted the arc-alarm-update branch June 11, 2024 19:05
This was referenced Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants