Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting too many failures in processingJob #32

Open
grossamit opened this issue Oct 18, 2021 · 18 comments
Open

Getting too many failures in processingJob #32

grossamit opened this issue Oct 18, 2021 · 18 comments

Comments

@grossamit
Copy link

ClientError: Failed to download data. ListObjectsV2 failed for s3://.... nextToken:[null]: Unable to execute request to S3

The thing is that sometimes it succeed and sometimes not.
I've also added a code to wait 10sec after the notebook upload and verify that the file exists after the upload with ListObjectsV2.

@tomfaulhaber
Copy link
Contributor

@grossamit What are you running when you see this error? Where is the error coming from? Do you see this error all the time or just intermittently?

@grossamit
Copy link
Author

grossamit commented Dec 31, 2021

@tomfaulhaber it happens intermittently .
please note that I'm specifying VPCs and Subnet lists during my run using the VpcConfig and NetworkConfig.
I get these errors a lot. The weird issue is that if you wait for approx. 40min it recovers until happens again.
Full error:
Failed (ClientError: Failed to download data. ListObjectsV2 failed for s3://aws-emr-resources-406095609952-us-east-1/dataAccess/[email protected]/papermill_input/searchVariables_MP_extractMaping_prod.ipynb-2021-12-31-09-26-33.ipynb, nextToken:[null]: Unable to execute request to S3)

I'm running the notebook also with parameters and instance type.

@tomfaulhaber
Copy link
Contributor

@grossamit My guess is that this is an issue with way you're routing connections from your SageMaker Processing node to your VPC. One thing would be to check that your subnet definitions are right, your security groups don't have fixed IPs, or whether there's anything else that could mess things up based on what IP address that SageMaker Processing instance is given.

@grossamit
Copy link
Author

@tomfaulhaber thanks for your reply !
I believe that if it would be the case ,than it would not work constantly. Currently I have ~30% success :-(
I do not have fixed IPs . I'll play a little with the subnets.

@tomfaulhaber
Copy link
Contributor

@grossamit I would expect exactly this behavior if, for example, you had a VPC with multiple subnets but only enabled the S3 endpoint for a single subnet.

@tomfaulhaber
Copy link
Contributor

Playing with this today, we realized there's an interaction between processing jobs and VPCs that's working differently than I understood. I think we can come up with a workaround.

@alena-m
Copy link

alena-m commented Jan 13, 2022

Hi @tomfaulhaber
Could you share your findings? We face the same issue with Processing job. We run job in private subnets with NetworkConfig:

 "NetworkConfig": {
        "EnableInterContainerTrafficEncryption": false,
        "EnableNetworkIsolation": false/true, # (we tried both)
        "VpcConfig": {
            "SecurityGroupIds": [
                "sg-xxx" 
            ],
            "Subnets": [
                "subnet-xxx",
                "subnet-xxx",
                "subnet-xxx"
            ]
        }
	}

But it can't access bucket with input data:

sagemaker.exceptions.UnexpectedStatusException: Error for Processing job my-processing-job: Failed. 
Reason: ClientError: Failed to download data. 
ListObjectsV2 failed for s3://my-bucket/input-data/, nextToken:[null]: 
Unable to execute request to S3

@grossamit
Copy link
Author

Hi @tomfaulhaber ,
Any progress with this? It really makes the solution unreliable.
Anything I can help?

@papierGaylard
Copy link

Any updates on this? Exact same issue

@papierGaylard
Copy link

For me,

I need to run my sagemaker processing job within a VPC and within a subnet, I'm specifying the subnet and VPC like such:

--extra '{ "NetworkConfig": { "EnableInterContainerTrafficEncryption": false, "EnableNetworkIsolation": false, "VpcConfig": { "SecurityGroupIds": [ "sg-xxxxxx" ], "Subnets": [ "subnet-xxxxxx" ] } } }'
However I get a s3.listobject failure as soon as I use it. I need to operate within a vpc/subnet with an IP range to connect to another service as well.

@papierGaylard
Copy link

Hi @tomfaulhaber , Any progress with this? It really makes the solution unreliable. Anything I can help?

So I think you need to create a VPC enpoint. For some reason processing jobs doesn't have access to aws internal services despite being inside your VPC/Subnet, having an ARN and role. You need to create a VPC endpoint, which is kind of like a pipe that allows aws sagemaker processing jobs direct access to specific internal services.

Would probably be a good thing to add to the script, hah.

@mattiasliljenzin
Copy link

Also experiencing this. Any updates?

@schematical
Copy link

schematical commented Apr 5, 2023

I ended up switching back to no VPC after a few tries and realized that my IAM roles were slightly off. I only had the bucket arn with ** after it when I needed to add just the bucket name with no ** after it. Like as follows:

        {
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::bucket", <----- WAS MISSING THIS
                "arn:aws:s3:::bucket/**"
            ]
        },

I'll update it if I get it working with the VPC

@ConstantSun
Copy link

ClientError: Failed to download data. ListObjectsV2 failed for s3://.... nextToken:[null]: Unable to execute request to S3

The thing is that sometimes it succeed and sometimes not. I've also added a code to wait 10sec after the notebook upload and verify that the file exists after the upload with ListObjectsV2.

I got the same error, but then I remove all the network config in my processing job. And it works !

@gabriel-loka
Copy link

Any update here?, I created a S3 VPC endpoint but still giving me that error. I'm using training jobs in a isolated subnet

@takeru1205
Copy link

@gabriel-loka
I got same problem, I solved to allow 443 port to Security Group of connection.

@telmen87
Copy link

Facing this problem. Any update ?

@michelpf
Copy link

michelpf commented Jan 6, 2024

I had the same error.
In my case I preferred not to have a NAT GW, thus I used the public access option when I configured the domain in Sagemaker.
Following a suggestion to create a VPC Endpoint for S3 solved this problem for me.

Thanks @papierGaylard ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests