Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sysctl change does not actually get applied #18208

Closed

Comments

@christhegrand
Copy link

christhegrand commented May 21, 2024

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to a user, that user is claiming responsibility for the issue.
  • Customers working with a Google Technical Account Manager or Customer Engineer can ask them to reach out internally to expedite investigation and resolution of this issue.

Terraform Version & Provider Version(s)

Terraform v1.8.2
on darwin_arm64

  • provider registry.terraform.io/hashicorp/google v5.22.0
  • provider registry.terraform.io/hashicorp/google-beta v5.27.0

Your version of Terraform is out of date! The latest version
is 1.8.3. You can update by downloading from https://www.terraform.io/downloads.html

Affected Resource(s)

google_container_cluster

Terraform Configuration

# See https://www.terraform.io/docs/providers/google/r/container_cluster.html
resource "google_container_cluster" "primary" {
  # https://registry.terraform.io/providers/hashicorp/google/latest/docs/guides/provider_versions#using-the-google-beta-provider
  provider       = google-beta
  name           = "${var.cluster-name}-cluster"
  location       = var.location
  node_locations = var.node-locations

  timeouts {
    read   = "60m"
    create = "120m"
    delete = "120m"
    update = "120m"
  }

  // I think we should enable this everywhere, but at the moment I put this behind
  // a variable because some clusters have it enabled and some don't, and I'm just
  // trying to perform a TF upgrade here.
  enable_shielded_nodes = var.shielded-nodes-enabled

  # We can't create a cluster with no node pool defined, but we want to only use
  # separately managed node pools. So we create the smallest possible default
  # node pool and immediately delete it.
  remove_default_node_pool = true
  initial_node_count       = 1

  # Make the cluster VPC-native instead of routes-based
  # See https://cloud.google.com/kubernetes-engine/docs/how-to/alias-ips
  ip_allocation_policy {
  }

  # This, confusingly enough, is actually the configuration for auto-provisioned
  # node pools. We want this for the ASR pods that use a GPU.
  # See documentation in
  # https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-provisioning,
  # https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler,
  # and https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-autoscaler.
  cluster_autoscaling {
    enabled = true

    autoscaling_profile = var.optimize-utilization ? "OPTIMIZE_UTILIZATION" : "BALANCED"

    // The cluster autoscaler can create nodes for use by GPU-enabled pods, and
    // pods that do not need GPUs. Be very careful adding any changes here that
    // should affect only one of those groups of pods.
    dynamic "resource_limits" {
      for_each = var.cluster-resource-limits
      content {
        resource_type = resource_limits.value["resource_type"]
        minimum       = resource_limits.value["minimum"]
        maximum       = resource_limits.value["maximum"]
      }
    }

    auto_provisioning_defaults {
      oauth_scopes = [
        "https://www.googleapis.com/auth/logging.write",
        "https://www.googleapis.com/auth/monitoring",
        "https://www.googleapis.com/auth/devstorage.read_only",
        "https://www.googleapis.com/auth/compute"
      ]

      management {
        auto_repair  = "true"
        auto_upgrade = "true"
      }

      disk_size  = var.disk-size-gb
      image_type = "COS_CONTAINERD"
    }
  }

  # Setting an empty username and password explicitly disables basic auth
  master_auth {
    client_certificate_config {
      issue_client_certificate = false
    }
  }

  node_pool_defaults {
    node_config_defaults {
      gcfs_config {
        enabled = var.is-image-streaming-enabled
      }
    }
  }

  node_config {
    linux_node_config {
      sysctls = {
        "net.core.somaxconn" = "4096"
      }
    }
  }
}

Debug Output

2024-05-21T13:17:04.915-0700 [WARN]  Provider "provider[\"registry.terraform.io/hashicorp/google-beta\"]" produced an unexpected new value for module.cluster.google_container_cluster.primary, but we are tolerating it because it is using the legacy plugin SDK.
    The following problems may be the cause of any confusing errors from downstream operations:
      - .node_config[0].linux_node_config: block count changed from 1 to 0

Expected Behavior

When I run terraform apply with this change, Terraform reports that the change has been applied successfully. When I run terraform apply again, I should see that there are no more changes to apply.

Actual Behavior

When I run terraform apply with this change, Terraform reports that the change has been applied successfully. But when I run terraform apply again, I see the sysctl change show up as an unapplied change. Basically, Terraform thinks the change has been applied successfully, but it does not seem like it is actually persisting.

Steps to reproduce

  1. terraform apply

Important Factoids

No response

References

No response

b/342657392

b/343052499

@github-actions github-actions bot added forward/review In review; remove label to forward service/container labels May 21, 2024
@ggtisc
Copy link
Collaborator

ggtisc commented May 23, 2024

Hi @christhegrand

As I understand your issue is that after creating the resource "google_container_cluster" "primary" the next property wasn't applied:

  node_config {
    linux_node_config {
      sysctls = {
        "net.core.somaxconn" = "4096"
      }
    }
  }

And the unique step to reproduce this issue is to run a terraform apply to create this resource. Please confirm if that is right.

Finally please send us the complete LOG.DEBUG output and the terraform.tfstate result after the resource creation to confirm what is happening in your environment.

@ggtisc ggtisc self-assigned this May 23, 2024
@christhegrand
Copy link
Author

christhegrand commented May 23, 2024

As I understand your issue is that after creating the resource "google_container_cluster" "primary" the next property wasn't applied:

node_config {
linux_node_config {
sysctls = {
"net.core.somaxconn" = "4096"
}
}
}
And the unique step to reproduce this issue is to run a terraform apply to create this resource. Please confirm if that is right.

Yes, that is correct.

@christhegrand
Copy link
Author

christhegrand commented May 23, 2024

I added a Github gist with files that show the output of terraform apply with debug logging turned on, together with the Terraform state.

https://gist.github.com/christhegrand/8e1bf8de6e842298a762ca71c4ae1462

@trodge trodge removed the forward/review In review; remove label to forward label May 24, 2024
@ggtisc
Copy link
Collaborator

ggtisc commented May 27, 2024

Possible bug in update, it looks like node_config only has one updatable field, and linux_node_config has not been implemented for updates.

@ggtisc ggtisc removed their assignment May 27, 2024
@lornaluo
Copy link

lornaluo commented Jun 6, 2024

Instead of using node_config in resource google_container_cluster :

resource "google_container_cluster" "primary" {
  ... 
  node_config {
    linux_node_config {
      sysctls = {
        "net.core.somaxconn" = "4096"
      }
    }
  }
 ...
}

You can try with node_config in resource google_container_node_pool to config linux node:

# Separately Managed Node Pool
resource "google_container_node_pool" "primary_nodes" {
... 
  node_config {
    ... 
    linux_node_config {
      sysctls = {
        "net.core.somaxconn" = "4096"
      }
    }
  .... 
}

I tried apply these setting with terraform and it works.

@lornaluo
Copy link

lornaluo commented Jun 7, 2024

Adde some reference for this.

In https://registry.terraform.io/providers/hashicorp/google/5.32.0/docs/resources/container_cluster#argument-reference
It mentions google_container_cluster.node_config is not recommended with terraform

image

Recommended to create with a separately managed node pool (recommended)

@christhegrand
Copy link
Author

Thanks. We are trying to add node config defaults to autoprovisioned node pools. Is it not possible to do that using Terraform?

@lornaluo
Copy link

lornaluo commented Jun 8, 2024

Maybe you can try first enable node auto-provisioning on the cluster with Terraform , then specify which node pools are auto-provisioned:

https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-provisioning#mark_node_auto-provisioned

@mauriciopoppe
Copy link

As mentioned in terraform-google-modules/terraform-google-kubernetes-engine#15 the default nodepool causes trouble with managing the cluster. Personally I always delete the default nodepool and create additional nodepools and manage their lifecycle outside the lifecycle of the cluster resource.

@christhegrand can you update your cluster provisioning script to create additional nodepool resources and delete the default nodepool?

@christhegrand
Copy link
Author

I did delete the default node pool:

  # We can't create a cluster with no node pool defined, but we want to only use
  # separately managed node pools. So we create the smallest possible default
  # node pool and immediately delete it.
  remove_default_node_pool = true

Maybe you can try first enable node auto-provisioning on the cluster with Terraform , then specify which node pools are auto-provisioned:

https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-provisioning#mark_node_auto-provisioned

That's interesting. I didn't realize I could create a node pool and then enable auto-provisioning on it.

@wangzhen127
Copy link

@christhegrand Are you able to make changes to the additional node pools (not using the cluster level node_config field)?

@lornaluo
Copy link

lornaluo commented Jul 1, 2024

@christhegrand I have created a bug internally for GKE to deprecate the cluster level field node_config.
Could we close this one if there are no other other issues ?

@christhegrand
Copy link
Author

Sure. Thank you!

@wyardley
Copy link

wyardley commented Oct 21, 2024

If I'm not mistaken, this may get fixed via the fix for #19225 (GoogleCloudPlatform/magic-modules#12014)

Note that if you define the node pool via node_pool {} or separately as google_container_node_pool updates should already function correctly. If you taint / recreate the nodepool, it should also update. But the fix above should fix the update in place path as well.

@lornaluo I think node_config for the default node-pool is necessary for the case whereremove_default_node_pool is set to false; while it's discouraged to use the implicit default node pool creation, guessing it's used widely enough that this functionality is unlikely to get removed entirely?

@christhegrand
Copy link
Author

I tried applying this sysctl config with version 6.12.0 of the Google provider, and it's still not working. I see the same behavior as before - I can run terraform apply without any errors, but when I run it a second time, or check the Terraform plan after running it, I see that the sysctl changes are among the changes that are to be applied.

@wyardley
Copy link

@christhegrand just want to confirm after rereading the comments above - is this on a cluster with the default pool removed and node pool autoprovisioning, or are you seeing the issue on the default node pool?

for the autoprovisioned node pools, I think you may need node_pool_auto_config instead?

@christhegrand
Copy link
Author

It's for an autoprovisioned node pool with the default pool removed. Let me try node_pool_auto_config.

@christhegrand
Copy link
Author

Hm, how would I specify sysctl information in that block? I don't see a way to do that in the documentation: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#nested_node_pool_auto_config

@wyardley
Copy link

Is it possible to configure Linux sysctls on a cluster with nodepool autoscaling via the console / CLI? Looking at the API docs, I think that setting is output only, and not seeing it available in node_config_defaults either from a quick look?

Either way, I could be wrong, but as I think was said above, I think setting node_config should not be done at all if you’re also setting the option to remove the default node pool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment