-
Notifications
You must be signed in to change notification settings - Fork 979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Defer to lazy client initialization to prevent dialing localhost #1078
base: main
Are you sure you want to change the base?
Conversation
Thanks for taking a shot a this issue! The solution looks good and seems to follow the same pattern we applied to the Kubernetes alpha provider. I'll be testing this a bit to see if I can come up with an automated test scenario. |
This looks good. We're going to hold off merging for a bit. Some wider ranging credentials changes are in flight here: #1052. Once those merge, we can go ahead and merge this too. I think it's easier in that order since this is smaller and easier to rebase if it runs into conflicts. |
Any movement on this? #1052 was merged a month ago |
Is this going to be merged soon ? We're pending on this fix in order to actually utilize terraform for kubernetes deployments with GPU on GKE.. (older provider version doesn't work with taint "nvidia.com/gpu") |
@alexsomesan +1 request to see this merged after @pfault can resolve merge conflicts. This bug appears surface in any workflow that would lead to cluster recreation, so is of particularly great pain to anyone iterating development in this matter (per: #1028 (comment)) |
Hey guys, we are facing this problem as well. |
Please, for the love of terraform, merge this already. |
@pfault Do you have time to fix the merge conflicts? |
Happy to resolve conflicts for this, just let me know how to collaborate. As a simple route, I can create a new PR by forking the PR branch. cc @flavioab |
e656c11
to
b0fae1f
Compare
Re-based to main, fixing merge conflict |
@alexsomesan @flavioab can we get this merged? |
Not to bump, but bump. Currently stuck on v1 due to the way this blocks migration and iteration. |
What the other person above me said. I don't wanna bump, but uh... Bump. This thing needs to get merged. |
Pinging @alexsomesan and @jrhouston 👋🏻 |
Please merge this. It has been driving me nuts on a daily basis |
I'm pretty sure there is a reason why it is still not merged, but... |
When can we expect the merge? |
I think the effect of this change is focused around the removal of this part:
The rest of the code shuffle looks like it wouldn't make a difference in behaviour. However, as noted by those comment notes, that part was added there a workaround. It's likely that removing it will cause side-effects which are hard to predict now. To move this conversation forward, we would need some solid reproduction configurations. If anyone is willing to contribute some that would demonstrate how this change makes an impact, we can move from there. |
I'm afraid that it would not be easy for solid reproduction configuration since this error happens "sometimes"... Nonetheless our configuration is copied from hashicorp example:
The point is that cluster is applied in other root dir. We use AKS. What "helps" to get this error is to run apply azurerm_kubernetes_cluster and cluster setup from different machines. |
hey @alexsomesan @flavioab, any ETA here? could this be merged in any time soon, please? |
Could we please get this merged sometime soon? |
@apparentlymart @radeksimko sorry for bothering, but you guys are awesome, and this renders kubernetes provider unusable for multiple-cluster setup when you're changing some cluster property (e.g. disk_size for node). Maybe you can somehow influence this ? 🥲 |
I don't have sufficient knowledge about Kubernetes or this provider to weigh in on when and how this could be merged, I'm afraid. Seeing this does make me think a bit about the solution though: Terraform Core should be sending an unknown value to the provider in the situation where the configuration is derived from unknown values from elsewhere, which can in principle allow a provider to differentiate between totally unset vs. not yet known. However, I believe the old SDK hides that detail when you access the configuration through its API, making it seem like the unknown values are not set at all. If backward compatibility is a concern for this change (I'm not sure) then it might be worth considering how the SDK could potentially expose information about whether particular arguments are known and then have this special new behavior occur in the unknown case, leaving the unset case behaving however it currently does. (Sounds like it currently defaults to localhost in that case.) |
bump |
Thank you for your input! Too bad that it seems that the root cause is a lot deeper. For anyone who stumbles upon this issue, please note that this code alone can break kubernetes provider right now at version 2.7.1
Just change the At this point I don't think kubernetes operator can be used without some extraordinary operational detours (e.g. separate management of cluster and k8s, separate states and remote state data source) |
In my case, if I run (adapted to your example) Can you check if it works for you? |
Yes, it does work. |
Well, it's clearly a workaround, not a solution. |
FWIW, in other cases where folks have employed that workaround it's typically only been needed for initial creation and not for subsequent updates, because the argument in question stays stable after the object has been created. Again, I don't know this particular situation well enough to say for certain, but I'd guess that the hostname of a cluster would stay constant for the life of that cluster and so the extra run with |
Well actually in the example above first run isn't a problem at all, but on the contrary the subsequent runs are the problem if you change any cluster parameter that will trigger it's recreation.
If we create outputs for example configuration above, outputting cluster endpoint and ca_cert, this is what I get after initial creation and after update with
First run:
Changed initial_node_count to
So kubernetes cluster was re-created. I was also hoping to circumvent this by using |
@apparentlymart As @yellowmegaman points out creation is fine, it's anything that reconfigures with the cluster that seems to cause the problem. I use the word We use EKS for our k8s clusters and if I try to update the cluster version this problem appears. I'm not sure what black magic AWS pulls behind the curtains when you upgrade and EKS cluster but it seems to trigger this dialing of As a workaround I go into the EKS console in AWS and upgrade my cluster from there then I go to terraform and change the EKS version number in our terraform code after the fact. If I do this I circumvent the problem in my use case. It's obviously not a scalable or convenient solution but we have few enough EKS cluster that we can manage. |
Just noticed that Even if we split CI/CD pipeline to all vs kubernetes-based lists of targets, it won't be enough to create new infrastructure. It won't be IaC anymore, because of number of manual actions required. Almost same goes for terraform_remote_state data source, you'll need separate repo's to push (in correct order) infra / k8s workloads configurations to avoid conflicts. This way adding a new cluster with terraform-managed k8s workload also becomes a great hassle. At this point null_resources doesn't seem like Running out of ideas on how to use this provider with any resources subject to change. @alexsomesan @radeksimko @dak1n1 @jrhouston do you have bandwidth to review/merge this PR please? |
We have swallowed the bitter pill of splitting the configuration into two deployments and then using data sources to get the modified clusters, effectively decoupling them. (One terraform run for setting/updating the clusters and one for configuring the clusters with this provider) |
@DanielHabenicht we use the strategy described here and at the moment it works well and we have no reason to split as you have done. However, we are worried that we are missing something that will trap us later. I assume you tried the pattern above so, if you don't mind, what was it exactly that you couldn't achieve with that pattern that caused you to split into two deployments? Thanks in advance! |
It was just a hassle to manually set all targets (as Kubernetes is not the only thing we deploy), so we opted for clear split. |
Any update here? |
@@ -226,7 +226,7 @@ type KubeClientsets interface { | |||
} | |||
|
|||
type kubeClientsets struct { | |||
config *restclient.Config | |||
config *clientcmd.ClientConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be a pointer because ClientConfig
is an interface.
I've been debugging this and noticed that for me, the provider logs
You can only see this with Apparently the provider silently recovers from this (basically ignoring the error) by using a default configuration (which connects to localhost). Further investigation revealed that I had set the But I think this PR intended to address a different root cause. |
This is a very important bug that needs to be addressed. Is there any update on this issue? |
Just hit the bug and lost plenty of time to find the cause :( |
Same @danielllek |
I hit the bug with TF 1.0 but it seems like it is gone in v1.2.x
in tf v1.2.x, it seems like datasource refresh now happens before the provider configuration, and the data is now available. I looked at the code in depth, and this PR would not help, because the provider is initialized by calling the ConfigureContextFunc anyway, so delaying the config does not make any difference if the datasource data is not there. |
This is maybe a quirk of the module in question, but this issue is still be reported with v1.3.9 and the latest providers: terraform-aws-modules/terraform-aws-eks#2501 |
See my reply on the issue: the example failing code does not use a datasource to configure the provider, so it may indeed still fail. Using the data source lookup pattern, it should work with tf1.2 |
@streamnsight , I create the Kubernetes cluster in the same module and then access it and provision kubernetes manifests in. Can you help me understand how I would change to use |
@chamini2 just follow the examples |
@streamnsight this solution won't work if the cluster doesn't already exist in AWS. |
The only difference I see from my solution is the explicit
I will add that and see if I don't get any more errors. |
Description
Defers the initialization of the actual client to first use to prevent dialling http://localhost instead of late configured credentials.
Fixes cases where using parameters like
config_path
that are generated during the state apply lead to use of an empty default config.Acceptance tests
Output from acceptance testing:
Release Note
Release note for CHANGELOG:
References
#758
#1028
Community Note