sysctl change does not actually get applied #18208

christhegrand · 2024-05-21T20:26:21Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.
If an issue is assigned to a user, that user is claiming responsibility for the issue.
Customers working with a Google Technical Account Manager or Customer Engineer can ask them to reach out internally to expedite investigation and resolution of this issue.

Terraform Version & Provider Version(s)

Terraform v1.8.2
on darwin_arm64

provider registry.terraform.io/hashicorp/google v5.22.0
provider registry.terraform.io/hashicorp/google-beta v5.27.0

Your version of Terraform is out of date! The latest version
is 1.8.3. You can update by downloading from https://www.terraform.io/downloads.html

Affected Resource(s)

google_container_cluster

Terraform Configuration

# See https://www.terraform.io/docs/providers/google/r/container_cluster.html
resource "google_container_cluster" "primary" {
  # https://registry.terraform.io/providers/hashicorp/google/latest/docs/guides/provider_versions#using-the-google-beta-provider
  provider       = google-beta
  name           = "${var.cluster-name}-cluster"
  location       = var.location
  node_locations = var.node-locations

  timeouts {
    read   = "60m"
    create = "120m"
    delete = "120m"
    update = "120m"
  }

  // I think we should enable this everywhere, but at the moment I put this behind
  // a variable because some clusters have it enabled and some don't, and I'm just
  // trying to perform a TF upgrade here.
  enable_shielded_nodes = var.shielded-nodes-enabled

  # We can't create a cluster with no node pool defined, but we want to only use
  # separately managed node pools. So we create the smallest possible default
  # node pool and immediately delete it.
  remove_default_node_pool = true
  initial_node_count       = 1

  # Make the cluster VPC-native instead of routes-based
  # See https://cloud.google.com/kubernetes-engine/docs/how-to/alias-ips
  ip_allocation_policy {
  }

  # This, confusingly enough, is actually the configuration for auto-provisioned
  # node pools. We want this for the ASR pods that use a GPU.
  # See documentation in
  # https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-provisioning,
  # https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler,
  # and https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-autoscaler.
  cluster_autoscaling {
    enabled = true

    autoscaling_profile = var.optimize-utilization ? "OPTIMIZE_UTILIZATION" : "BALANCED"

    // The cluster autoscaler can create nodes for use by GPU-enabled pods, and
    // pods that do not need GPUs. Be very careful adding any changes here that
    // should affect only one of those groups of pods.
    dynamic "resource_limits" {
      for_each = var.cluster-resource-limits
      content {
        resource_type = resource_limits.value["resource_type"]
        minimum       = resource_limits.value["minimum"]
        maximum       = resource_limits.value["maximum"]
      }
    }

    auto_provisioning_defaults {
      oauth_scopes = [
        "https://www.googleapis.com/auth/logging.write",
        "https://www.googleapis.com/auth/monitoring",
        "https://www.googleapis.com/auth/devstorage.read_only",
        "https://www.googleapis.com/auth/compute"
      ]

      management {
        auto_repair  = "true"
        auto_upgrade = "true"
      }

      disk_size  = var.disk-size-gb
      image_type = "COS_CONTAINERD"
    }
  }

  # Setting an empty username and password explicitly disables basic auth
  master_auth {
    client_certificate_config {
      issue_client_certificate = false
    }
  }

  node_pool_defaults {
    node_config_defaults {
      gcfs_config {
        enabled = var.is-image-streaming-enabled
      }
    }
  }

  node_config {
    linux_node_config {
      sysctls = {
        "net.core.somaxconn" = "4096"
      }
    }
  }
}

Debug Output

2024-05-21T13:17:04.915-0700 [WARN]  Provider "provider[\"registry.terraform.io/hashicorp/google-beta\"]" produced an unexpected new value for module.cluster.google_container_cluster.primary, but we are tolerating it because it is using the legacy plugin SDK.
    The following problems may be the cause of any confusing errors from downstream operations:
      - .node_config[0].linux_node_config: block count changed from 1 to 0

Expected Behavior

When I run terraform apply with this change, Terraform reports that the change has been applied successfully. When I run terraform apply again, I should see that there are no more changes to apply.

Actual Behavior

When I run terraform apply with this change, Terraform reports that the change has been applied successfully. But when I run terraform apply again, I see the sysctl change show up as an unapplied change. Basically, Terraform thinks the change has been applied successfully, but it does not seem like it is actually persisting.

Steps to reproduce

terraform apply

Important Factoids

No response

References

No response

b/342657392

b/343052499

The text was updated successfully, but these errors were encountered:

ggtisc · 2024-05-23T21:39:18Z

Hi @christhegrand

As I understand your issue is that after creating the resource "google_container_cluster" "primary" the next property wasn't applied:

  node_config {
    linux_node_config {
      sysctls = {
        "net.core.somaxconn" = "4096"
      }
    }
  }

And the unique step to reproduce this issue is to run a terraform apply to create this resource. Please confirm if that is right.

Finally please send us the complete LOG.DEBUG output and the terraform.tfstate result after the resource creation to confirm what is happening in your environment.

christhegrand · 2024-05-23T22:38:48Z

As I understand your issue is that after creating the resource "google_container_cluster" "primary" the next property wasn't applied:

node_config {
linux_node_config {
sysctls = {
"net.core.somaxconn" = "4096"
}
}
}
And the unique step to reproduce this issue is to run a terraform apply to create this resource. Please confirm if that is right.

Yes, that is correct.

christhegrand · 2024-05-23T22:40:04Z

I added a Github gist with files that show the output of terraform apply with debug logging turned on, together with the Terraform state.

https://gist.github.com/christhegrand/8e1bf8de6e842298a762ca71c4ae1462

ggtisc · 2024-05-27T16:38:18Z

Possible bug in update, it looks like node_config only has one updatable field, and linux_node_config has not been implemented for updates.

lornaluo · 2024-06-06T07:52:17Z

Instead of using node_config in resource google_container_cluster :

resource "google_container_cluster" "primary" {
  ... 
  node_config {
    linux_node_config {
      sysctls = {
        "net.core.somaxconn" = "4096"
      }
    }
  }
 ...
}

You can try with node_config in resource google_container_node_pool to config linux node:

# Separately Managed Node Pool
resource "google_container_node_pool" "primary_nodes" {
... 
  node_config {
    ... 
    linux_node_config {
      sysctls = {
        "net.core.somaxconn" = "4096"
      }
    }
  .... 
}

I tried apply these setting with terraform and it works.

lornaluo · 2024-06-07T01:20:50Z

Adde some reference for this.

In https://registry.terraform.io/providers/hashicorp/google/5.32.0/docs/resources/container_cluster#argument-reference
It mentions google_container_cluster.node_config is not recommended with terraform

Recommended to create with a separately managed node pool (recommended)

christhegrand · 2024-06-07T16:21:09Z

Thanks. We are trying to add node config defaults to autoprovisioned node pools. Is it not possible to do that using Terraform?

lornaluo · 2024-06-08T00:30:18Z

Maybe you can try first enable node auto-provisioning on the cluster with Terraform , then specify which node pools are auto-provisioned:

https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-provisioning#mark_node_auto-provisioned

mauriciopoppe · 2024-06-10T21:43:34Z

As mentioned in terraform-google-modules/terraform-google-kubernetes-engine#15 the default nodepool causes trouble with managing the cluster. Personally I always delete the default nodepool and create additional nodepools and manage their lifecycle outside the lifecycle of the cluster resource.

@christhegrand can you update your cluster provisioning script to create additional nodepool resources and delete the default nodepool?

christhegrand · 2024-06-10T21:50:02Z

I did delete the default node pool:

  # We can't create a cluster with no node pool defined, but we want to only use
  # separately managed node pools. So we create the smallest possible default
  # node pool and immediately delete it.
  remove_default_node_pool = true

Maybe you can try first enable node auto-provisioning on the cluster with Terraform , then specify which node pools are auto-provisioned:

https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-provisioning#mark_node_auto-provisioned

That's interesting. I didn't realize I could create a node pool and then enable auto-provisioning on it.

wangzhen127 · 2024-06-17T21:26:08Z

@christhegrand Are you able to make changes to the additional node pools (not using the cluster level node_config field)?

lornaluo · 2024-07-01T18:32:14Z

@christhegrand I have created a bug internally for GKE to deprecate the cluster level field node_config.
Could we close this one if there are no other other issues ?

christhegrand · 2024-07-01T18:41:30Z

Sure. Thank you!

wyardley · 2024-10-21T19:46:30Z

If I'm not mistaken, this may get fixed via the fix for #19225 (GoogleCloudPlatform/magic-modules#12014)

Note that if you define the node pool via node_pool {} or separately as google_container_node_pool updates should already function correctly. If you taint / recreate the nodepool, it should also update. But the fix above should fix the update in place path as well.

@lornaluo I think node_config for the default node-pool is necessary for the case whereremove_default_node_pool is set to false; while it's discouraged to use the implicit default node pool creation, guessing it's used widely enough that this functionality is unlikely to get removed entirely?

christhegrand · 2024-11-21T17:52:06Z

I tried applying this sysctl config with version 6.12.0 of the Google provider, and it's still not working. I see the same behavior as before - I can run terraform apply without any errors, but when I run it a second time, or check the Terraform plan after running it, I see that the sysctl changes are among the changes that are to be applied.

wyardley · 2024-11-22T09:57:49Z

@christhegrand just want to confirm after rereading the comments above - is this on a cluster with the default pool removed and node pool autoprovisioning, or are you seeing the issue on the default node pool?

for the autoprovisioned node pools, I think you may need node_pool_auto_config instead?

christhegrand · 2024-11-22T17:25:56Z

It's for an autoprovisioned node pool with the default pool removed. Let me try node_pool_auto_config.

christhegrand · 2024-11-22T17:32:34Z

Hm, how would I specify sysctl information in that block? I don't see a way to do that in the documentation: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#nested_node_pool_auto_config

wyardley · 2024-11-22T23:32:38Z

Is it possible to configure Linux sysctls on a cluster with nodepool autoscaling via the console / CLI? Looking at the API docs, I think that setting is output only, and not seeing it available in node_config_defaults either from a quick look?

Either way, I could be wrong, but as I think was said above, I think setting node_config should not be done at all if you’re also setting the option to remove the default node pool.

christhegrand added the bug label May 21, 2024

github-actions bot added forward/review In review; remove label to forward service/container labels May 21, 2024

ggtisc self-assigned this May 23, 2024

ggtisc added the waiting-response label May 23, 2024

github-actions bot removed the waiting-response label May 23, 2024

trodge removed the forward/review In review; remove label to forward label May 24, 2024

modular-magician added the forward/linked label May 24, 2024

ggtisc removed their assignment May 27, 2024

ggtisc removed the forward/linked label May 27, 2024

modular-magician added the forward/linked label May 27, 2024

wyardley mentioned this issue Oct 21, 2024

container: fix missing updates for google_container_cluster.node_config subfields GoogleCloudPlatform/magic-modules#12014

Merged

roaks3 closed this as completed in GoogleCloudPlatform/magic-modules#12014 Oct 25, 2024

This was referenced Oct 25, 2024

container: fix missing updates for google_container_cluster.node_config subfields #20038

Merged

container: fix missing updates for google_container_cluster.node_config subfields hashicorp/terraform-provider-google-beta#8522

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sysctl change does not actually get applied #18208

sysctl change does not actually get applied #18208

christhegrand commented May 21, 2024 •

edited by modular-magician

Loading

ggtisc commented May 23, 2024

christhegrand commented May 23, 2024 •

edited

Loading

christhegrand commented May 23, 2024 •

edited

Loading

ggtisc commented May 27, 2024

lornaluo commented Jun 6, 2024

lornaluo commented Jun 7, 2024

christhegrand commented Jun 7, 2024

lornaluo commented Jun 8, 2024

mauriciopoppe commented Jun 10, 2024

christhegrand commented Jun 10, 2024

wangzhen127 commented Jun 17, 2024

lornaluo commented Jul 1, 2024

christhegrand commented Jul 1, 2024

wyardley commented Oct 21, 2024 •

edited

Loading

christhegrand commented Nov 21, 2024

wyardley commented Nov 22, 2024

christhegrand commented Nov 22, 2024

christhegrand commented Nov 22, 2024

wyardley commented Nov 22, 2024

sysctl change does not actually get applied #18208

sysctl change does not actually get applied #18208

Comments

christhegrand commented May 21, 2024 • edited by modular-magician Loading

Community Note

Terraform Version & Provider Version(s)

Affected Resource(s)

Terraform Configuration

Debug Output

Expected Behavior

Actual Behavior

Steps to reproduce

Important Factoids

References

ggtisc commented May 23, 2024

christhegrand commented May 23, 2024 • edited Loading

christhegrand commented May 23, 2024 • edited Loading

ggtisc commented May 27, 2024

lornaluo commented Jun 6, 2024

lornaluo commented Jun 7, 2024

christhegrand commented Jun 7, 2024

lornaluo commented Jun 8, 2024

mauriciopoppe commented Jun 10, 2024

christhegrand commented Jun 10, 2024

wangzhen127 commented Jun 17, 2024

lornaluo commented Jul 1, 2024

christhegrand commented Jul 1, 2024

wyardley commented Oct 21, 2024 • edited Loading

christhegrand commented Nov 21, 2024

wyardley commented Nov 22, 2024

christhegrand commented Nov 22, 2024

christhegrand commented Nov 22, 2024

wyardley commented Nov 22, 2024

christhegrand commented May 21, 2024 •

edited by modular-magician

Loading

christhegrand commented May 23, 2024 •

edited

Loading

christhegrand commented May 23, 2024 •

edited

Loading

wyardley commented Oct 21, 2024 •

edited

Loading