Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ManagedNodeGroup is assigned to incorrect security group by default #1275

Open
JustASquid opened this issue Jul 25, 2024 · 4 comments
Open
Labels
impact/usability Something that impacts users' ability to use the product easily and intuitively kind/bug Some behavior is incorrect or out of spec

Comments

@JustASquid
Copy link
Contributor

What happened?

I noticed an interesting issue whenever spinning up a ManagedNodeGroup. Nodes in the group were not able to access the DNS server or otherwise connect to other nodes in the default NG in the cluster.

The problem was because the default NG created by EKS, not Pulumi, is assigned to a security group that is different to subsequent MNG's. Indeed, if we take a look at the cluster in EKS, we can see that there are 2 security groups associated. One of them has the following description:

EKS created security group applied to ENI that is attached to EKS Control Plane master nodes, as well as any managed workloads

And the other is simply

Managed by Pulumi.

The default NG is assigned to the EKS-created SG, while any subsequent MNG's are assigned to the Pulumi-created one. This means that any MNG nodes cannot communicate with core cluster resources.

It may be possible to work around the issue by skipping the default node group creation and making all nodes be handled by MNG's however I haven't tested this yet.

Example

N/A

Output of pulumi about

CLI
Version 3.120.0
Go Version go1.22.4
Go Compiler gc

Plugins
KIND NAME VERSION
resource aws 6.41.0
resource awsx 2.12.3
resource docker 4.5.4
resource docker 3.6.1
resource eks 2.7.1
resource kubernetes 4.13.1
language nodejs unknown

Host
OS ubuntu
Version 22.04
Arch x86_64

This project is written in nodejs: executable='/home/daniel/.nvm/versions/node/v21.6.1/bin/node' version='v21.6.1'

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

@JustASquid JustASquid added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels Jul 25, 2024
@t0yv0
Copy link
Member

t0yv0 commented Jul 26, 2024

Thank you for filing this issue @JustASquid, could you add a quick repro program to make it easier to reproduce this exact issue. Thank you.

@t0yv0 t0yv0 added needs-repro Needs repro steps before it can be triaged or fixed and removed needs-triage Needs attention from the triage team labels Jul 26, 2024
@rquitales
Copy link
Member

@JustASquid, thank you for reporting this issue and providing detailed information. I am able to repro the problem using the example in our repository: Managed NodeGroups Example - with skipDefaultNodeGroup is set to false. I confirmed the issue by deploying an nginx workload tainted to the default node group and a curlimage workload to the managed node group. The curlimage workload couldn't reach the nginx pod unless the workloads were switched between node groups.

You've correctly identified that the two different security groups are causing communication issues. However, it's actually the managed node group that's using the security group created by EKS, while the default node group uses the security group managed by Pulumi. When the EKS provider is set to not skip the default node group creation, we create a security group that only allows intra-node communication within that group. Reference: Security Group Configuration.

The EKS managed node group uses the default cluster security group created by AWS. Even if an additional security group is specified during cluster creation (which is used by the default node group), it won't be attached to the managed node group instances. To enable communication between these node groups, you need to use a custom launch template for the ManagedNodeGroup to specify the security group created by Pulumi.

Here’s a TypeScript example of setting this up:

const cluster = new eks.Cluster("cluster", {
  // ... (other configurations)
  skipDefaultNodeGroup: false,
});

// Create Managed Node Group with custom launch template to use the security group that the default node group uses.

function createUserData(cluster: aws.eks.Cluster, extraArgs: string): pulumi.Output<string> {
    const userdata = pulumi
        .all([
            cluster.name,
            cluster.endpoint,
            cluster.certificateAuthority.data,
        ])
        .apply(([clusterName, clusterEndpoint, clusterCertAuthority]) => {
            return `MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="

--==MYBOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"

#!/bin/bash

/etc/eks/bootstrap.sh --apiserver-endpoint "${clusterEndpoint}" --b64-cluster-ca "${clusterCertAuthority}" "${clusterName}" ${extraArgs}
--==MYBOUNDARY==--`;
        });

    // Encode the user data as base64.
    return pulumi
        .output(userdata)
        .apply((ud) => Buffer.from(ud, "utf-8").toString("base64"));
}

const lt = new aws.ec2.LaunchTemplate("my-mng-lt", {
  imageId: "ami-0cfd96d646e5535a8",
  instanceType: "t3.medium",
  vpcSecurityGroupIds: [cluster.defaultNodeGroup!.nodeSecurityGroup.id], // <- This is where we define the SG to be used by the MNG.
    userData: createUserData(cluster.core.cluster, "--kubelet-extra-args --node-labels=mylabel=myvalue"), // This is required to enable instances to join the cluster.

});

const mng = new eks.ManagedNodeGroup("cluster-my-mng", {
  // ... (other configurations)
  cluster: cluster,
  launchTemplate: {
    id: lt.id,
    version: pulumi.interpolate`${lt.latestVersion}`,
  },
});

Alternatively, as you mentioned, you could skip creating the default node group, and everything should work as expected. Please let us know if this resolves your issue or if you need further assistance!

@rquitales rquitales added awaiting-feedback Blocked on input from the author and removed needs-repro Needs repro steps before it can be triaged or fixed labels Jul 26, 2024
@rquitales rquitales self-assigned this Jul 26, 2024
@JustASquid
Copy link
Contributor Author

Thank you @rquitales and you are exactly right - I got the order back to front in my original post, indeed the MNG's are using the EKS-created SG.

And I was able to work around the issue by skipping the default node group.

I do feel that this behavior is non-ideal though, as the path of least resistance when setting up a cluster is to use the default node group. It's easy to run into the case where it cannot communicate with any subsequent MNGs, and the problem can manifest in a very non-obvious way (In my case, no DNS resolution on the MNG nodes).

Going down the road of specifying a custom launch template is not trivial, least of all because you need to fetch an AMI ID. So I guess the question is, is there a particular reason it works this way? Why not just have the default NG be assigned the cluster's EKS-created security group?

@pulumi-bot pulumi-bot added needs-triage Needs attention from the triage team and removed awaiting-feedback Blocked on input from the author labels Jul 29, 2024
@flostadler
Copy link
Contributor

Another possible workaround is setting up the necessary security group rules to allow the different node groups to communicate. Like this for example:

const cluster = new eks.Cluster("example-managed-nodegroups", {
  // ... (other configurations)
  skipDefaultNodeGroup: false,
});

export const clusterSecurityGroup = cluster.eksCluster.vpcConfig.clusterSecurityGroupId;

const eksClusterIngressRule = new aws.vpc.SecurityGroupIngressRule(
  `selfManagedNodeIngressRule`,
  {
      description: "Allow managed workloads to communicate with self managed nodes",
      fromPort: 0,
      toPort: 0,
      ipProtocol: "-1",
      securityGroupId: clusterSecurityGroup,
      referencedSecurityGroupId: cluster.nodeSecurityGroup.id,
  }
);

const nodeIngressRule = new aws.vpc.SecurityGroupIngressRule(
  `managedNodeIngressRule`,
  {
      description: "Allow self managed workloads to communicate with managed workloads",
      fromPort: 0,
      toPort: 0,
      ipProtocol: "-1",
      securityGroupId: cluster.nodeSecurityGroup.id,
      referencedSecurityGroupId: clusterSecurityGroup,
  }
);

I'm gonna check what it would take to add this to the component itself. I'm not necessarily concerned about security implications here because we already have open firewalls within the Pulumi managed security group:

const nodeIngressRule = new aws.ec2.SecurityGroupRule(
`${name}-eksNodeIngressRule`,
{
description: "Allow nodes to communicate with each other",
type: "ingress",
fromPort: 0,
toPort: 0,
protocol: "-1", // all
securityGroupId: nodeSecurityGroup.id,
self: true,
},
{ parent, provider },
);

Equally the EKS managed security group also allows all communication within itself: https://docs.aws.amazon.com/eks/latest/userguide/sec-group-reqs.html

@flostadler flostadler added impact/usability Something that impacts users' ability to use the product easily and intuitively and removed needs-triage Needs attention from the triage team labels Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact/usability Something that impacts users' ability to use the product easily and intuitively kind/bug Some behavior is incorrect or out of spec
Projects
None yet
Development

No branches or pull requests

5 participants