-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multiple regions/availabilty zones if an instance type cannot be started #60
Comments
I just came to file the same request, getting 95% of the time this:
and this tool fails if I remove Needing this for a CI, so this is a big issue. Thank you! |
To support multi-AZ We could make |
Sounds like a very good idea
Thanks!
…On Fri, Aug 27, 2021, 09:55 Philipp Schmid ***@***.***> wrote:
We could make config.input.subnetId a list and try to iterate randomly
over it here
<https://github.com/machulav/ec2-github-runner/blob/8911518f6d7d3cbfc7de6cf9ecac0931f8025798/src/aws.js#L35>
and stop when an instance started.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#60 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAULLE5TX6QKIUH46I5DDRLT64ZFLANCNFSM5BZWQ76A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
FWIW, when using EC2 launch templates, one can apparently omit the explicit selection of AZ/subnet. It might be worth it to investigate/test if this facility would imply intelligent automatic selection of AZ (taking capacity into account). |
Turns out one can only omit subnet ID from run parameters, if you are using the default VPC. Per related AWS docs:
|
If I do use the default vpc, will it try in one AZ or will keep on trying
in all AZs?
Anyhow I would like that to be across regions as well.
I did solve it using a very ugly workaround but seems like working. In the
start job, I have added a step that if it fails, it triggers the workflow
again with different subnet ID. The subnet is of the start step uses an env
variable that uses an input parameter.
…On Mon, Aug 30, 2021, 21:39 Jukka Palomäki ***@***.***> wrote:
Turns out one can only omit subnet ID from run parameters, if you are
using the default VPC.
[EC2-VPC] If you don't specify a subnet ID, we choose a default subnet
from your default VPC for you. If you don't have a default VPC, you must
specify a subnet ID in the request.
See AWS docs
<https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_RunInstances.html>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#60 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAULLE2MPVD3PKELGNZXOITT7PF6RANCNFSM5BZWQ76A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I don't know, but I suspect it would just pick one. What I'd like to know is whether the automatic subnet selection algorithm is AZ capacity-aware or not. |
I doubt that and I'll explain why.
Capacity of what? Vcpu? Mem? GPU? You can have enough vcpus in a certain AZ
but no GPUs.
I think it just picks up a subnet randomly and hope for the best. For us it
is even worst because we are using mostly GPUs and not all AZs in a certain
region has that type that we are looking for.
…On Mon, Aug 30, 2021, 22:18 Jukka Palomäki ***@***.***> wrote:
If I do use the default vpc, will it try in one AZ or will keep on trying
in all AZs?
I don't know, but I suspect it would just pick one.
What I'd like to know is if the automatic selection algorithm is AZ
capacity-aware.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#60 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAULLE3G443Y42KOWGTW5D3T7PKR3ANCNFSM5BZWQ76A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
All needed capacity. AWS must be checking them anyhow, at instance launch time, to be able to distribute/optimize load. Looking at https://aws.amazon.com/premiumsupport/knowledge-center/ec2-insufficient-capacity-errors/ and then https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/troubleshooting-launch.html#troubleshooting-launch-capacity, the latter link does mention "not specifying AZ" as one solution. 🤷♂️ That said, one option could be to EDIT: amended workaround proposal because it looks like composite actions do not support |
Can be a solution but I think it'll be best if the code can go over list of
regions and AZs and try in each and every one of them.
I did something similar in Jenkins pipeline
…On Mon, Aug 30, 2021, 23:31 Jukka Palomäki ***@***.***> wrote:
All needed capacity. AWS must be checking them anyhow, at instance launch
time, to be able to distribute/optimize load.
Looking at
https://aws.amazon.com/premiumsupport/knowledge-center/ec2-insufficient-capacity-errors/
and then
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/troubleshooting-launch.html#troubleshooting-launch-capacity,
the latter link does mention "not specifying AZ" as one solution. 🤷♂️
That said, one option could be to abstract the runner start job in a composite
action
<https://docs.github.com/en/actions/creating-actions/creating-a-composite-action>
that tries ec2-github-runner start with different region/AZ settings (in
consecutive, conditional steps) until it succeeds, returning the start mode
outputs from the successful step?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#60 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAULLE4QPMTAGC4AXAZP56DT7PTADANCNFSM5BZWQ76A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
That can probably be fixed by amending the |
@stas00 would something like this work? (NOTE: untested and some details omitted for brevity):
Also, if we supported launch templates (see #65), then multi-region would be less awkward as we could have a template per region (also less inputs per EDIT: fixed the |
Thank you for writing out a concrete example, @jpalomaki! I will try and report back. So it's the same sub-region we are using the same AMI /security group, but only rotating though the subnet-ids. Got it. |
Np. NOTE: I now fixed the |
So I have done quite a few experiments and your example works perfectly well, @jpalomaki - thank you so much. I switched to us-east-1 since us-east-2 didn't even have 3 subnets that supported the instance type I needed. On us-east-1 I think there are only 3. It'd be nice to be able to remove the repetition of The final config is here: |
So is it going to support only different AZs but not different regions?
Thanks
…On Wed, Sep 1, 2021, 07:41 Stas Bekman ***@***.***> wrote:
So I have done quite a few experiments and your example works perfectly
well, @jpalomaki <https://github.com/jpalomaki> - thank you so much.
I switched to us-east-1 since us-east-2 didn't even have 3 subnets that
supported the instance type I needed. On us-east-1 I think there are only 3.
It'd be nice to be able to remove the repetition of aws-resource-tags and
other sections if at all possible. If not, that's OK too.
The final config is here:
https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/9e14c02a1dd22e4d36e2ee9a33e44d33774b8de7/.github/workflows/main.yml#L17
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#60 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAULLE6U5J7YFN65XDN6USTT7WVHNANCNFSM5BZWQ76A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
this particular setup is just sub-regions. I'm concerned this might not be enough still at times. So different regions would be even more fault tolerant, but this is already a great start. |
I agree that this is a good start but with the types we're using,
unfortunately, it won't be enough :-(
…On Wed, Sep 1, 2021, 08:24 Stas Bekman ***@***.***> wrote:
this particular setup is just sub-regions.
I'm concerned this might not be enough still at times. So different
regions would be even more fault tolerant, but this is already a great
start.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#60 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAULLE53MJIJFANPPGKODVTT7W2KBANCNFSM5BZWQ76A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
@stas00, great that you've made it to work! I hope that the launch templates from #65 will make selecting the AZ from the region less painful in the future. Maybe it will be helpful also for the region selection when you have one launch template in each region and then try to create instances in each region one by one using the example from @jpalomaki. But for now, I close the issue as we have the solution. Let's monitor the situation and re-open it if required. |
…InsufficientInstanceCapacity error Ref machulav#60
…InsufficientInstanceCapacity error Ref machulav#60
…InsufficientInstanceCapacity error Ref machulav#60
Hey, I've implemented the idea of @philschmid
Check this |
We are using instances types that require lots of reasources. Because of, from time to time it is impossible to start an instance of some type in a certain region and a certain AZ.
It'll be great if this plugin will be able to run over list of regions/AZs (subnets and security groups) and start the instance where its type is available.
The text was updated successfully, but these errors were encountered: