Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple regions/availabilty zones if an instance type cannot be started #60

Closed
gilShin opened this issue Aug 9, 2021 · 21 comments
Labels
question Further information is requested

Comments

@gilShin
Copy link

gilShin commented Aug 9, 2021

We are using instances types that require lots of reasources. Because of, from time to time it is impossible to start an instance of some type in a certain region and a certain AZ.
It'll be great if this plugin will be able to run over list of regions/AZs (subnets and security groups) and start the instance where its type is available.

@stas00
Copy link

stas00 commented Aug 27, 2021

I just came to file the same request, getting 95% of the time this:

Error: AWS EC2 instance starting error
Error: InsufficientInstanceCapacity: We currently do not have sufficient p3.8xlarge capacity in the Availability Zone you requested (us-east-2a). Our system will be working on provisioning additional capacity. You can currently get p3.8xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-2b.
Error: We currently do not have sufficient p3.8xlarge capacity in the Availability Zone you requested (us-east-2a). Our system will be working on provisioning additional capacity. You can currently get p3.8xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-2b.

and this tool fails if I remove aws-region from the config file.

Needing this for a CI, so this is a big issue.

Thank you!

@philschmid
Copy link

philschmid commented Aug 27, 2021

To support multi-AZ We could make config.input.subnetId a list and try to iterate randomly over it here and stop when an instance started.
I think multi-Region is hard to implement since Security groups, AMIs, etc are always bound to a Region.

@gilShin
Copy link
Author

gilShin commented Aug 27, 2021 via email

@jpalomaki
Copy link
Contributor

FWIW, when using EC2 launch templates, one can apparently omit the explicit selection of AZ/subnet.

It might be worth it to investigate/test if this facility would imply intelligent automatic selection of AZ (taking capacity into account).

@jpalomaki
Copy link
Contributor

jpalomaki commented Aug 30, 2021

Turns out one can only omit subnet ID from run parameters, if you are using the default VPC. Per related AWS docs:

[EC2-VPC] If you don't specify a subnet ID, we choose a default subnet from your default VPC for you. If you don't have a default VPC, you must specify a subnet ID in the request.

@gilShin
Copy link
Author

gilShin commented Aug 30, 2021 via email

@jpalomaki
Copy link
Contributor

jpalomaki commented Aug 30, 2021

If I do use the default vpc, will it try in one AZ or will keep on trying
in all AZs?

I don't know, but I suspect it would just pick one. What I'd like to know is whether the automatic subnet selection algorithm is AZ capacity-aware or not.

@gilShin
Copy link
Author

gilShin commented Aug 30, 2021 via email

@jpalomaki
Copy link
Contributor

jpalomaki commented Aug 30, 2021

All needed capacity. AWS must be checking them anyhow, at instance launch time, to be able to distribute/optimize load.

Looking at https://aws.amazon.com/premiumsupport/knowledge-center/ec2-insufficient-capacity-errors/ and then https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/troubleshooting-launch.html#troubleshooting-launch-capacity, the latter link does mention "not specifying AZ" as one solution. 🤷‍♂️

That said, one option could be to abstract have the runner start job in a composite action that tries to try ec2-github-runner start with different region/AZ settings (in consecutive, conditional steps) until it succeeds, returning the start outputs from the first successful step?

EDIT: amended workaround proposal because it looks like composite actions do not support if conditions

@gilShin
Copy link
Author

gilShin commented Aug 30, 2021 via email

@stas00
Copy link

stas00 commented Aug 31, 2021

And while we are at it, when it fails to start because of the region not having an instance I get a while bunch of errors on stopping:

snapshot_14

@jpalomaki
Copy link
Contributor

And while we are at it, when it fails to start because of the region not having an instance I get a while bunch of errors on stopping

That can probably be fixed by amending the if condition on the stop job (to also check the result of the start job)

@jpalomaki
Copy link
Contributor

jpalomaki commented Aug 31, 2021

@stas00 would something like this work? (NOTE: untested and some details omitted for brevity):

jobs:
  start-runner:
    runs-on: ubuntu-20.04
    outputs:
      label: ${{ steps.result.outputs.label }}
      ec2-instance-id: ${{ steps.result.outputs.ec2-instance-id }}
    steps:
      ...
      - id: try-us-east-2a
        uses: machulav/ec2-github-runner@v2
        continue-on-error: true
        with:
          mode: start
          subnet-id: subnet-on-2a
          ...
      - id: try-us-east-2b
        if: steps.try-us-east-2a.outcome == 'failure'
        uses: machulav/ec2-github-runner@v2
        continue-on-error: true
        with:
          mode: start
          subnet-id: subnet-on-2b
          ...
      - id: try-us-east-2c
        if: steps.try-us-east-2a.outcome == 'failure' && steps.try-us-east-2b.outcome == 'failure'
        uses: machulav/ec2-github-runner@v2
        with:
          mode: start
          subnet-id: subnet-on-2c
          ...
      - id: result
        run: |
          if [ "${{ steps.try-us-east-2a.outcome }}" = "success" ]; then
            echo "::set-output name=label::${{ steps.try-us-east-2a.outputs.label }}"
            echo "::set-output name=ec2-instance-id::${{ steps.try-us-east-2a.outputs.ec2-instance-id }}"
          fi
          if [ "${{ steps.try-us-east-2b.outcome }}" = "success" ]; then
            echo "::set-output name=label::${{ steps.try-us-east-2b.outputs.label }}"
            echo "::set-output name=ec2-instance-id::${{ steps.try-us-east-2b.outputs.ec2-instance-id }}"
          fi
          if [ "${{ steps.try-us-east-2c.outcome }}" = "success" ]; then
            echo "::set-output name=label::${{ steps.try-us-east-2c.outputs.label }}"
            echo "::set-output name=ec2-instance-id::${{ steps.try-us-east-2c.outputs.ec2-instance-id }}"
          fi

Also, if we supported launch templates (see #65), then multi-region would be less awkward as we could have a template per region (also less inputs per ec2-github-runner step)?

EDIT: fixed the if condition on the try-us-east-2c step to only run if both previous tries failed.

@stas00
Copy link

stas00 commented Aug 31, 2021

Thank you for writing out a concrete example, @jpalomaki! I will try and report back.

So it's the same sub-region we are using the same AMI /security group, but only rotating though the subnet-ids. Got it.

@jpalomaki
Copy link
Contributor

Thank you for writing out a concrete example, @jpalomaki! I will try and report back.

So it's the same sub-region we are using the same AMI /security group, but only rotating though the subnet-ids. Got it.

Np.

NOTE: I now fixed the if condition on the try-us-east-2c step so that it only runs if both previous steps failed. Also be sure to monitor how many instances/runners are actually spawned, to avoid excess costs.

@stas00
Copy link

stas00 commented Sep 1, 2021

So I have done quite a few experiments and your example works perfectly well, @jpalomaki - thank you so much.

I switched to us-east-1 since us-east-2 didn't even have 3 subnets that supported the instance type I needed. On us-east-1 I think there are only 3.

It'd be nice to be able to remove the repetition of aws-resource-tags and other sections if at all possible. If not, that's OK too.

The final config is here:

https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/9e14c02a1dd22e4d36e2ee9a33e44d33774b8de7/.github/workflows/main.yml#L17

@gilShin
Copy link
Author

gilShin commented Sep 1, 2021 via email

@stas00
Copy link

stas00 commented Sep 1, 2021

this particular setup is just sub-regions.

I'm concerned this might not be enough still at times. So different regions would be even more fault tolerant, but this is already a great start.

@gilShin
Copy link
Author

gilShin commented Sep 1, 2021 via email

@machulav
Copy link
Owner

machulav commented Sep 2, 2021

@stas00, great that you've made it to work!
@jpalomaki, thank you a lot for your help and examples! I really appreciate your help here!

I hope that the launch templates from #65 will make selecting the AZ from the region less painful in the future. Maybe it will be helpful also for the region selection when you have one launch template in each region and then try to create instances in each region one by one using the example from @jpalomaki.

But for now, I close the issue as we have the solution. Let's monitor the situation and re-open it if required.

@machulav machulav closed this as completed Sep 2, 2021
@machulav machulav added the question Further information is requested label Sep 2, 2021
mgaitan added a commit to Shiphero/ec2-github-runner that referenced this issue Oct 23, 2023
mgaitan added a commit to Shiphero/ec2-github-runner that referenced this issue Oct 23, 2023
mgaitan added a commit to Shiphero/ec2-github-runner that referenced this issue Oct 23, 2023
@mgaitan
Copy link

mgaitan commented Oct 23, 2023

Hey, I've implemented the idea of @philschmid

We could make config.input.subnetId a list and try to iterate randomly over it

Check this
Shiphero@69a9f36

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants