aws ecs execute-command fails with TargetNotConnectedException #9070

leejayhsu · 2024-11-12T00:16:42Z

Describe the bug

I am unable to use ecs execute-command to connect to my ecs fargate task

Regression Issue

Select this option if this issue appears to be a regression.

Expected Behavior

I should be able to connect to my ecs fargate task

Current Behavior

It fails to connect to ecs fargate task

command

aws ecs execute-command  \
--region us-west-2 \
--cluster core-services \
--task d179d101efa94c98aa62340b5705d726 \
--container app \
--command "/bin/bash" \
--interactive

Error

The Session Manager plugin was installed successfully. Use the AWS CLI to start a session.


An error occurred (TargetNotConnectedException) when calling the ExecuteCommand operation: The execute command failed due to an internal error. Try again later.

amazon-ecs-exec-checker output

Prerequisites for check-ecs-exec.sh v0.7
-------------------------------------------------------------
  jq      | OK (/opt/homebrew/bin/jq)
  AWS CLI | OK (/opt/homebrew/bin/aws)

-------------------------------------------------------------
Prerequisites for the AWS CLI to use ECS Exec
-------------------------------------------------------------
  AWS CLI Version        | OK (aws-cli/2.19.4 Python/3.12.7 Darwin/24.0.0 source/arm64)
  Session Manager Plugin | OK (1.2.688.0)

-------------------------------------------------------------
Checks on ECS task and other resources
-------------------------------------------------------------
Region : us-west-2
Cluster: core-services
Task   : d179d101efa94c98aa62340b5705d726
-------------------------------------------------------------
  Cluster Configuration  |
     KMS Key       : Not Configured
     Audit Logging : OVERRIDE
     S3 Bucket Name: Not Configured
     CW Log Group  : /ecs/dev/core-services, Encryption Enabled: true
  Can I ExecuteCommand?  | arn:aws:iam::xxxxx:user/xxxxx
     ecs:ExecuteCommand: allowed
     ssm:StartSession denied?: allowed
  Task Status            | RUNNING
  Launch Type            | Fargate
  Platform Version       | 1.4.0
  Exec Enabled for Task  | OK
  Container-Level Checks |
    ----------
      Managed Agent Status
    ----------
         1. RUNNING for "log-router"
         2. RUNNING for "datadog-agent"
         3. RUNNING for "app"
    ----------
      Init Process Enabled (dev-app-task-def:555)
    ----------
         1. Enabled - "app"
         2. Disabled - "datadog-agent"
         3. Disabled - "log-router"
    ----------
      Read-Only Root Filesystem (dev-app-task-def:555)
    ----------
         1. Disabled - "app"
         2. Disabled - "datadog-agent"
         3. Disabled - "log-router"
  Task Role Permissions  | arn:aws:iam::xxxxx:role/ecsTaskExecutionRole
     ssmmessages:CreateControlChannel: allowed
     ssmmessages:CreateDataChannel: allowed
     ssmmessages:OpenControlChannel: allowed
     ssmmessages:OpenDataChannel: allowed
     -----
     logs:DescribeLogGroups: allowed
     logs:CreateLogStream: allowed
     logs:DescribeLogStreams: allowed
     logs:PutLogEvents: allowed
  VPC Endpoints          |
    Found existing endpoints for vpc-xxxxx:
      - com.amazonaws.us-west-2.s3
      - com.amazonaws.us-west-2.secretsmanager
      - com.amazonaws.us-west-2.ecr.api
      - com.amazonaws.us-west-2.ecr.dkr
      - com.amazonaws.us-west-2.ssmmessages
  Environment Variables  | (dev-app-task-def:555)
       1. container "app"
       - AWS_ACCESS_KEY: not defined
       - AWS_ACCESS_KEY_ID: not defined
       - AWS_SECRET_ACCESS_KEY: not defined
       2. container "datadog-agent"
       - AWS_ACCESS_KEY: not defined
       - AWS_ACCESS_KEY_ID: not defined
       - AWS_SECRET_ACCESS_KEY: not defined
       3. container "log-router"
       - AWS_ACCESS_KEY: not defined
       - AWS_ACCESS_KEY_ID: not defined
       - AWS_SECRET_ACCESS_KEY: not defined

Reproduction Steps

run this command:

aws ecs execute-command  \
--region us-west-2 \
--cluster core-services \
--task d179d101efa94c98aa62340b5705d726 \
--container app \
--command "/bin/bash" \
--interactive

Possible Solution

No response

Additional Information/Context

No response

CLI version used

2.19.4

Environment details (OS name and version, etc.)

Python/3.12.7 Darwin/24.0.0 source/arm64

The text was updated successfully, but these errors were encountered:

tim-finnigan · 2024-11-12T17:29:36Z

Thanks for reaching out. The TargetNotConnectedException has been reported in several past issues. Have you tried looking through those?

In this troubleshooting post for it says you might get that error for the following reasons:

The Amazon ECS task role doesn't have the required permissions to run the execute-command command.

The AWS Identity and Access Management (IAM) role or user that's running the command doesn't have the required permissions.

Others have suggested that the issue could be fixed by changing your environment variables or updating your AMI.

Also could you explain why you marked this as potential-regression? Was this working for you in a previous version of the AWS CLI?

leejayhsu · 2024-11-12T19:57:27Z

Hi @tim-finnigan 👋

Yeah I have looked at most of those past issues, but I will look again to make sure I didn't miss any potential solutions.

For context, I'm using ecs fargate, platform version 1.4

Things I've tried to fix this:

verified that ecs task taskRoleArn and executionRoleArn both have the following permissions

{
    "Statement": [
        {
            "Action": [
                "ssmmessages:OpenDataChannel",
                "ssmmessages:OpenControlChannel",
                "ssmmessages:CreateDataChannel",
                "ssmmessages:CreateControlChannel",
            ],
            "Effect": "Allow",
            "Resource": "*"
        }
    ],
    "Version": "2012-10-17"
}

verified that my aws role that is trying to exec has the permission ecs:ExecuteCommand
ran https://github.com/aws-containers/amazon-ecs-exec-checker, no errors
ecs task has outbound internet connectivity (but I also created a vpc endpoing for ssm messages just in case com.amazonaws.us-west-2.ssmmessages)
do NOT have AWS_ACCESS_KEY_ID or AWS_SECRET_ACCESS_KEY as env vars in my tasks

ecs exec used to work for me, so I thought it would be ok to mark this as a regression. But this is only conjecture on my part, so please remove the tag if you feel it is appropriate!

rnathuji · 2024-11-14T16:21:35Z

Just to chime in on a potential regression: We are also experiencing this issue with Fargate where things were working fine, and then seemingly stopped working suddenly for no apparent reason. amazon-ecs-exec-checker is clear.

tim-finnigan · 2024-11-14T16:50:06Z

Thanks for following up - we may need to loop in ECS/Fargate here as well. Did this issue start occurring after updating to a specific version? Could you share your debug logs (with any sensitive info redacted) to help with further investigation?

rnathuji · 2024-11-14T17:37:39Z

@tim-finnigan - I...spoke too soon when chiming in above 😅 . I believe the issue was a bug in our infrastructure as code which caused some non-determinism related to the subnet associated with tasks. A container cycle caused some to land in an isolated subnet inadvertently, and that was the root issue for the "suddenly for no apparent reason". Fixing the IaC issue solved our problem.

tim-finnigan · 2024-11-14T22:02:44Z

@tim-finnigan - I...spoke too soon when chiming in above 😅 . I believe the issue was a bug in our infrastructure as code which caused some non-determinism related to the subnet associated with tasks. A container cycle caused some to land in an isolated subnet inadvertently, and that was the root issue for the "suddenly for no apparent reason". Fixing the IaC issue solved our problem.

No worries, thanks for following up and glad that issue is resolved. For the original issue author — I'll mention this troubleshooting guide again for reference: https://repost.aws/knowledge-center/fargate-ecs-exec-errors. If you're still seeing the issue, please share your debug logs for further investigation.

leejayhsu · 2024-11-15T18:35:16Z

hi @tim-finnigan

I've narrowed the problem down to a sidecar container [aws-fluent-bit](https://github.com/aws/aws-for-fluent-bit), which I was using to stream logs to datadog. I'm not exactly sure why it's a problem, but I can exec into the fargate task once I remove the aws-fluent-bit` container from the task definition.

Do you happen to know if there are any known issues that would cause fluent bit to interfere with ecs exec? This is the relevant part of the task def

{
    "name": "log-router",
    "image": "amazon/aws-for-fluent-bit:stable",
    "cpu": 0,
    "portMappings": [],
    "essential": false,
    "environment": [],
    "mountPoints": [],
    "volumesFrom": [],
    "user": "0",
    "dockerLabels": {
        "com.datadoghq.tags.service": "log-router",
        "com.datadoghq.tags.env": "dev"
    },
    "systemControls": [],
    "firelensConfiguration": {
        "type": "fluentbit",
        "options": {
            "config-file-type": "file",
            "config-file-value": "/fluent-bit/configs/parse-json.conf",
            "enable-ecs-log-metadata": "true"
        }
    }
}

leejayhsu · 2024-11-15T21:27:42Z

confirmed that removing aws-fluent-bit container from the task definition fixed the issue. now ecs exec is working properly.

github-actions · 2024-11-15T21:27:59Z

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.

lkashef · 2024-11-18T20:27:54Z

Hey @leejayhsu we are facing the same problem. I assume removing the log-router can't be a permanent solution, am curious what did you end up doing?

leejayhsu · 2024-11-19T04:46:49Z

hi @lkashef 👋
Actually removing log-router was my permanent solution 😄
It only existed in the task definition because the logging aggregator I used recommended streaming logs to it. I'm now just logging to cloudwatch, and no longer using fluent-bit for logging.

sorry this probably isn't the answer you were hoping for!

leejayhsu · 2024-11-19T19:14:41Z

@lkashef I also had another task which I couldn't exec into, and disabling logging in the datadog-agent container fixed it (this was quite unexpected).

leejayhsu added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Nov 12, 2024

github-actions bot added the potential-regression Marking this issue as a potential regression to be checked by team member label Nov 12, 2024

tim-finnigan self-assigned this Nov 12, 2024

tim-finnigan added the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Nov 12, 2024

github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Nov 12, 2024

tim-finnigan added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Nov 14, 2024

github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Nov 14, 2024

tim-finnigan added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Nov 14, 2024

github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Nov 15, 2024

leejayhsu closed this as completed Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws ecs execute-command fails with TargetNotConnectedException #9070

aws ecs execute-command fails with TargetNotConnectedException #9070

leejayhsu commented Nov 12, 2024

tim-finnigan commented Nov 12, 2024

leejayhsu commented Nov 12, 2024

rnathuji commented Nov 14, 2024

tim-finnigan commented Nov 14, 2024

rnathuji commented Nov 14, 2024

tim-finnigan commented Nov 14, 2024 •

edited

Loading

leejayhsu commented Nov 15, 2024 •

edited

Loading

leejayhsu commented Nov 15, 2024

github-actions bot commented Nov 15, 2024

lkashef commented Nov 18, 2024

leejayhsu commented Nov 19, 2024

leejayhsu commented Nov 19, 2024

aws ecs execute-command fails with TargetNotConnectedException #9070

aws ecs execute-command fails with TargetNotConnectedException #9070

Comments

leejayhsu commented Nov 12, 2024

Describe the bug

Regression Issue

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

CLI version used

Environment details (OS name and version, etc.)

tim-finnigan commented Nov 12, 2024

leejayhsu commented Nov 12, 2024

rnathuji commented Nov 14, 2024

tim-finnigan commented Nov 14, 2024

rnathuji commented Nov 14, 2024

tim-finnigan commented Nov 14, 2024 • edited Loading

leejayhsu commented Nov 15, 2024 • edited Loading

leejayhsu commented Nov 15, 2024

github-actions bot commented Nov 15, 2024

lkashef commented Nov 18, 2024

leejayhsu commented Nov 19, 2024

leejayhsu commented Nov 19, 2024

tim-finnigan commented Nov 14, 2024 •

edited

Loading

leejayhsu commented Nov 15, 2024 •

edited

Loading