Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws ecs execute-command fails with TargetNotConnectedException #9070

Closed
1 task done
leejayhsu opened this issue Nov 12, 2024 · 12 comments
Closed
1 task done

aws ecs execute-command fails with TargetNotConnectedException #9070

leejayhsu opened this issue Nov 12, 2024 · 12 comments
Assignees
Labels
bug This issue is a bug. customization Issues related to CLI customizations (located in /awscli/customizations) ecs-execute-command p2 This is a standard priority issue potential-regression Marking this issue as a potential regression to be checked by team member

Comments

@leejayhsu
Copy link

Describe the bug

I am unable to use ecs execute-command to connect to my ecs fargate task

Regression Issue

  • Select this option if this issue appears to be a regression.

Expected Behavior

I should be able to connect to my ecs fargate task

Current Behavior

It fails to connect to ecs fargate task

command

aws ecs execute-command  \
--region us-west-2 \
--cluster core-services \
--task d179d101efa94c98aa62340b5705d726 \
--container app \
--command "/bin/bash" \
--interactive

Error

The Session Manager plugin was installed successfully. Use the AWS CLI to start a session.


An error occurred (TargetNotConnectedException) when calling the ExecuteCommand operation: The execute command failed due to an internal error. Try again later.

amazon-ecs-exec-checker output

Prerequisites for check-ecs-exec.sh v0.7
-------------------------------------------------------------
  jq      | OK (/opt/homebrew/bin/jq)
  AWS CLI | OK (/opt/homebrew/bin/aws)

-------------------------------------------------------------
Prerequisites for the AWS CLI to use ECS Exec
-------------------------------------------------------------
  AWS CLI Version        | OK (aws-cli/2.19.4 Python/3.12.7 Darwin/24.0.0 source/arm64)
  Session Manager Plugin | OK (1.2.688.0)

-------------------------------------------------------------
Checks on ECS task and other resources
-------------------------------------------------------------
Region : us-west-2
Cluster: core-services
Task   : d179d101efa94c98aa62340b5705d726
-------------------------------------------------------------
  Cluster Configuration  |
     KMS Key       : Not Configured
     Audit Logging : OVERRIDE
     S3 Bucket Name: Not Configured
     CW Log Group  : /ecs/dev/core-services, Encryption Enabled: true
  Can I ExecuteCommand?  | arn:aws:iam::xxxxx:user/xxxxx
     ecs:ExecuteCommand: allowed
     ssm:StartSession denied?: allowed
  Task Status            | RUNNING
  Launch Type            | Fargate
  Platform Version       | 1.4.0
  Exec Enabled for Task  | OK
  Container-Level Checks |
    ----------
      Managed Agent Status
    ----------
         1. RUNNING for "log-router"
         2. RUNNING for "datadog-agent"
         3. RUNNING for "app"
    ----------
      Init Process Enabled (dev-app-task-def:555)
    ----------
         1. Enabled - "app"
         2. Disabled - "datadog-agent"
         3. Disabled - "log-router"
    ----------
      Read-Only Root Filesystem (dev-app-task-def:555)
    ----------
         1. Disabled - "app"
         2. Disabled - "datadog-agent"
         3. Disabled - "log-router"
  Task Role Permissions  | arn:aws:iam::xxxxx:role/ecsTaskExecutionRole
     ssmmessages:CreateControlChannel: allowed
     ssmmessages:CreateDataChannel: allowed
     ssmmessages:OpenControlChannel: allowed
     ssmmessages:OpenDataChannel: allowed
     -----
     logs:DescribeLogGroups: allowed
     logs:CreateLogStream: allowed
     logs:DescribeLogStreams: allowed
     logs:PutLogEvents: allowed
  VPC Endpoints          |
    Found existing endpoints for vpc-xxxxx:
      - com.amazonaws.us-west-2.s3
      - com.amazonaws.us-west-2.secretsmanager
      - com.amazonaws.us-west-2.ecr.api
      - com.amazonaws.us-west-2.ecr.dkr
      - com.amazonaws.us-west-2.ssmmessages
  Environment Variables  | (dev-app-task-def:555)
       1. container "app"
       - AWS_ACCESS_KEY: not defined
       - AWS_ACCESS_KEY_ID: not defined
       - AWS_SECRET_ACCESS_KEY: not defined
       2. container "datadog-agent"
       - AWS_ACCESS_KEY: not defined
       - AWS_ACCESS_KEY_ID: not defined
       - AWS_SECRET_ACCESS_KEY: not defined
       3. container "log-router"
       - AWS_ACCESS_KEY: not defined
       - AWS_ACCESS_KEY_ID: not defined
       - AWS_SECRET_ACCESS_KEY: not defined

Reproduction Steps

run this command:

aws ecs execute-command  \
--region us-west-2 \
--cluster core-services \
--task d179d101efa94c98aa62340b5705d726 \
--container app \
--command "/bin/bash" \
--interactive

Possible Solution

No response

Additional Information/Context

No response

CLI version used

2.19.4

Environment details (OS name and version, etc.)

Python/3.12.7 Darwin/24.0.0 source/arm64

@leejayhsu leejayhsu added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Nov 12, 2024
@github-actions github-actions bot added the potential-regression Marking this issue as a potential regression to be checked by team member label Nov 12, 2024
@tim-finnigan tim-finnigan self-assigned this Nov 12, 2024
@tim-finnigan tim-finnigan added the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Nov 12, 2024
@tim-finnigan
Copy link
Contributor

Thanks for reaching out. The TargetNotConnectedException has been reported in several past issues. Have you tried looking through those?

In this troubleshooting post for it says you might get that error for the following reasons:

  • The Amazon ECS task role doesn't have the required permissions to run the execute-command command.
  • The AWS Identity and Access Management (IAM) role or user that's running the command doesn't have the required permissions.

Others have suggested that the issue could be fixed by changing your environment variables or updating your AMI.

Also could you explain why you marked this as potential-regression? Was this working for you in a previous version of the AWS CLI?

@tim-finnigan tim-finnigan added customization Issues related to CLI customizations (located in /awscli/customizations) ecs-execute-command p2 This is a standard priority issue response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed investigating This issue is being investigated and/or work is in progress to resolve the issue. needs-triage This issue or PR still needs to be triaged. labels Nov 12, 2024
@leejayhsu
Copy link
Author

Hi @tim-finnigan 👋

Yeah I have looked at most of those past issues, but I will look again to make sure I didn't miss any potential solutions.

For context, I'm using ecs fargate, platform version 1.4

Things I've tried to fix this:

  • verified that ecs task taskRoleArn and executionRoleArn both have the following permissions
{
    "Statement": [
        {
            "Action": [
                "ssmmessages:OpenDataChannel",
                "ssmmessages:OpenControlChannel",
                "ssmmessages:CreateDataChannel",
                "ssmmessages:CreateControlChannel",
            ],
            "Effect": "Allow",
            "Resource": "*"
        }
    ],
    "Version": "2012-10-17"
}
  • verified that my aws role that is trying to exec has the permission ecs:ExecuteCommand
  • ran https://github.com/aws-containers/amazon-ecs-exec-checker, no errors
  • ecs task has outbound internet connectivity (but I also created a vpc endpoing for ssm messages just in case com.amazonaws.us-west-2.ssmmessages)
  • do NOT have AWS_ACCESS_KEY_ID or AWS_SECRET_ACCESS_KEY as env vars in my tasks

ecs exec used to work for me, so I thought it would be ok to mark this as a regression. But this is only conjecture on my part, so please remove the tag if you feel it is appropriate!

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Nov 12, 2024
@rnathuji
Copy link

Just to chime in on a potential regression: We are also experiencing this issue with Fargate where things were working fine, and then seemingly stopped working suddenly for no apparent reason. amazon-ecs-exec-checker is clear.

@tim-finnigan
Copy link
Contributor

Thanks for following up - we may need to loop in ECS/Fargate here as well. Did this issue start occurring after updating to a specific version? Could you share your debug logs (with any sensitive info redacted) to help with further investigation?

@tim-finnigan tim-finnigan added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Nov 14, 2024
@rnathuji
Copy link

@tim-finnigan - I...spoke too soon when chiming in above 😅 . I believe the issue was a bug in our infrastructure as code which caused some non-determinism related to the subnet associated with tasks. A container cycle caused some to land in an isolated subnet inadvertently, and that was the root issue for the "suddenly for no apparent reason". Fixing the IaC issue solved our problem.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Nov 14, 2024
@tim-finnigan
Copy link
Contributor

tim-finnigan commented Nov 14, 2024

@tim-finnigan - I...spoke too soon when chiming in above 😅 . I believe the issue was a bug in our infrastructure as code which caused some non-determinism related to the subnet associated with tasks. A container cycle caused some to land in an isolated subnet inadvertently, and that was the root issue for the "suddenly for no apparent reason". Fixing the IaC issue solved our problem.

No worries, thanks for following up and glad that issue is resolved. For the original issue author — I'll mention this troubleshooting guide again for reference: https://repost.aws/knowledge-center/fargate-ecs-exec-errors. If you're still seeing the issue, please share your debug logs for further investigation.

@tim-finnigan tim-finnigan added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Nov 14, 2024
@leejayhsu
Copy link
Author

leejayhsu commented Nov 15, 2024

hi @tim-finnigan

I've narrowed the problem down to a sidecar container [aws-fluent-bit](https://github.com/aws/aws-for-fluent-bit), which I was using to stream logs to datadog. I'm not exactly sure why it's a problem, but I can exec into the fargate task once I remove the aws-fluent-bit` container from the task definition.

Do you happen to know if there are any known issues that would cause fluent bit to interfere with ecs exec? This is the relevant part of the task def

{
    "name": "log-router",
    "image": "amazon/aws-for-fluent-bit:stable",
    "cpu": 0,
    "portMappings": [],
    "essential": false,
    "environment": [],
    "mountPoints": [],
    "volumesFrom": [],
    "user": "0",
    "dockerLabels": {
        "com.datadoghq.tags.service": "log-router",
        "com.datadoghq.tags.env": "dev"
    },
    "systemControls": [],
    "firelensConfiguration": {
        "type": "fluentbit",
        "options": {
            "config-file-type": "file",
            "config-file-value": "/fluent-bit/configs/parse-json.conf",
            "enable-ecs-log-metadata": "true"
        }
    }
}

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Nov 15, 2024
@leejayhsu
Copy link
Author

confirmed that removing aws-fluent-bit container from the task definition fixed the issue. now ecs exec is working properly.

Copy link

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.

@lkashef
Copy link

lkashef commented Nov 18, 2024

Hey @leejayhsu we are facing the same problem. I assume removing the log-router can't be a permanent solution, am curious what did you end up doing?

@leejayhsu
Copy link
Author

hi @lkashef 👋
Actually removing log-router was my permanent solution 😄
It only existed in the task definition because the logging aggregator I used recommended streaming logs to it. I'm now just logging to cloudwatch, and no longer using fluent-bit for logging.

sorry this probably isn't the answer you were hoping for!

@leejayhsu
Copy link
Author

@lkashef I also had another task which I couldn't exec into, and disabling logging in the datadog-agent container fixed it (this was quite unexpected).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. customization Issues related to CLI customizations (located in /awscli/customizations) ecs-execute-command p2 This is a standard priority issue potential-regression Marking this issue as a potential regression to be checked by team member
Projects
None yet
Development

No branches or pull requests

4 participants