Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS IMDS Token is not being reused until it expires #1990

Open
harshkiprofile opened this issue Nov 4, 2024 · 14 comments
Open

AWS IMDS Token is not being reused until it expires #1990

harshkiprofile opened this issue Nov 4, 2024 · 14 comments

Comments

@harshkiprofile
Copy link
Contributor

The script currently generates an IMDS token with a 6-hour (21,600 seconds) validity, but it lacks logic to reuse the token until it expires. Instead, it fetches a new token on every call, which could be optimized.

@harshkiprofile
Copy link
Contributor Author

image

oalbrigt pushed a commit that referenced this issue Nov 6, 2024
* Introduce a new shell function to reuse IMDS token
* Utilize the get_token function to reuse the token
* Move token renewal function to aws.sh for reuse in AWS agent scripts
@thimslugga
Copy link
Contributor

thimslugga commented Nov 6, 2024

This new change introduces security concerns as it's writing the session token to a file under /var/run and care must be taken to keep this safe and ensure permissions set correctly for read/write i.e. umask, etc.

Ideally the session token ttl time should be much lower i.e. ~10 min or less and not using the max ttl of 6 hours.

I have a fork of the aws-vpc-move-ip, which introduces some changes and attempts to get the instance ID from local file and falls back to the IMDS if that file is not accessible e.g. non-nitro based EC2 instance, etc.

main...thimslugga:resource-agents:aws-dev

@oalbrigt
Copy link
Contributor

oalbrigt commented Nov 7, 2024

This new change introduces security concerns as it's writing the session token to a file under /var/run and care must be taken to keep this safe and ensure permissions set correctly for read/write i.e. umask, etc.

We can just chmod 600 the file after saving the token.

@thimslugga
Copy link
Contributor

thimslugga commented Nov 7, 2024

We can also retrieve the instance ID from the /sys/devices/virtual/dmi/id/board_asset_tag file on Nitro-based EC2 instances. This would prevent always having to make curl requests to retrieve this info and we can fallback to IMDS.

https://github.com/thimslugga/resource-agents/blob/cadd53c8a5c019f1679bda47d3d5b70471af4d3b/heartbeat/aws-vpc-move-ip#L286-L293

The sleep and wait time should also be bumped from the current 3 and 1 to something a bit higher in event the IMDS is temporarily unavailable e.g. ~10s.

@oalbrigt
Copy link
Contributor

oalbrigt commented Nov 8, 2024

Nice. I'll make a patch with the suggestions next week.

@oalbrigt
Copy link
Contributor

I've made a patch based on your suggestions: #1995

@gguifelixamz
Copy link
Contributor

What's the actual difference between reading it from a file and making a request to the IMDS metadata? Before merging this I would like to hear from the requester @harshkiprofile what's the use-case and what's the level of optimization achieved from this change.

Were any issues observed by fetching a token every time?

Writing a plain text token (in fact, any kind of secret) to a file is generally a bad idea, even if you set exclusive permissions.

@gguifelixamz
Copy link
Contributor

@oalbrigt can we revert PR1991until we have the requestor to clear up a few questions about the use-case? I don't think we should upstream this feature without some background context with a change of security and usability issues.

@harshkiprofile
Copy link
Contributor Author

The decision to reuse the token by storing it locally is based on the fact that during AWS EC2 hypervisor maintenance events, the IMDS metadata API call to fetch token can sometimes become temporarily unavailable, even if the EC2 instance itself remains operational. If a new token were fetched with every request during such events, metadata token retrieval could fail. By storing the metadata token locally (temporarily), we ensure uninterrupted access to metadata, regardless of IMDS availability during maintenance windows, thus enhancing system reliability.

This approach prioritizes fault tolerance over pure performance optimization. In the event of AWS maintenance, local token storage allows continued metadata access, reducing downtime and mitigating the impact of transient IMDS issues.

While storing tokens or secrets in files does introduce potential risks, @oalbrigt has implemented restricted file permissions to minimize these concerns.

Performance-wise, no significant issues have been observed with fetching a token for each request. However, in environments where transient IMDS issues occur frequently, the local token storage method provides increased robustness, eliminating the need for constant retries to acquire a token.

That was the rationale behind this change. I would love to hear your feedback on this one! I will explore on what can we done to strengthen the security as I believe caching the token improves stability.

@harshkiprofile
Copy link
Contributor Author

Another key rationale behind this change is that it helps prevent failures caused by AWS API call throttling.

image
image

@oalbrigt
Copy link
Contributor

@gguifelixamz The requests adds up when you have multiple resources. Say you have one or more fence agents (e.g. for different zones), then one or more of awseip, awsvip, aws-vpc-move-ip, and aws-vpc-route53.

We can revert it if we find a reason to undo it. If not we can do improvements based on your suggestions or pull requests.

Next release should be in january/february, so we got time to fix any issues before then.

@thimslugga
Copy link
Contributor

By storing the metadata token locally (temporarily), we ensure uninterrupted access to metadata, regardless of IMDS availability during maintenance windows, thus enhancing system reliability.

The session token is used to make subsequent calls to the IMDS service. Having it stored locally would still result in calls to the IMDS failing if the service was unavailable...

Another key rationale behind this change is that it helps prevent failures caused by AWS API call throttling.

The IMDS != AWS EC2 API. The other functions in the script that make use of the aws-cli, will still result in the aws-cli requesting a session token from IMDS, retrieving credentials from the IMDS and then making the actual call to the AWS EC2 API.

The requests adds up when you have multiple resources. Say you have one or more fence agents (e.g. for different zones), then one or more of awseip, awsvip, aws-vpc-move-ip, and aws-vpc-route53.

See above, do not confuse the IMDS and AWS EC2 API as they are not the same thing.

@harshkiprofile
Copy link
Contributor Author

The IMDS != AWS EC2 API. The other functions in the script that make use of the aws-cli, will still result in the aws-cli requesting a session token from IMDS, retrieving credentials from the IMDS and then making the actual call to the AWS EC2 API.

PPS limit != EC2 API
Actually, this isn't related to AWS EC2 API throttling; by "via request throttling," I mean that certain limits (PPS), when reached, cause requests to be throttled. If you look at the highlighted text in the screenshot, you'll see it refers to the packets-per-second (PPS) limits being exceeded. This solution significantly helps reduce the consumed PPS limit as it reduce the recurring calls to IMDS to get the token.

image

As a user of resource agents, I encountered several challenges in my environment that led me to propose this suggestion. Implementing the recommendation has significantly enhanced the stability of the environment.

@oalbrigt
Copy link
Contributor

@gguifelixamz I've added a commit to be able to refresh the token e.g. if the file is corrupt for some reason:
b8d3ecc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants