Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A speedrun on consumer grade cards? #29

Open
fzyzcjy opened this issue Nov 22, 2024 · 16 comments
Open

A speedrun on consumer grade cards? #29

fzyzcjy opened this issue Nov 22, 2024 · 16 comments

Comments

@fzyzcjy
Copy link

fzyzcjy commented Nov 22, 2024

Hi thanks for the great repo! I would appreciate it if there can be a speed run on consumer cards e.g. RTX4090. Since it is 125M params, the RTX4090's 24GB memory should fit in the classical way, and thus it is trainable.

@KellerJordan
Copy link
Owner

a suggestion, to reduce memory you could run with a lower sequence length

@fzyzcjy
Copy link
Author

fzyzcjy commented Nov 25, 2024

I think so, thanks :) Just wondering whether there will be a speedrun like the current great one but focus on RTX4090's time, because many more people have consumer grade cards than H100s.

@naoro
Copy link

naoro commented Nov 25, 2024

I think a Google colab speed runs would be also awesome -
That would greatly commodotize research and experimentation

@fzyzcjy
Copy link
Author

fzyzcjy commented Nov 25, 2024

That looks interesting! (But I guess it may be too hard to have it run in acceptable times...)

@alexjc
Copy link

alexjc commented Nov 25, 2024

Realistically single-card speed-run would need a smaller model too, otherwise it's too slow to experiment.

Thinking:

n_layer = 8
n_embd = 512

The sequence length during training has been a variable factor in the last speed run; for evaluation it's fine if it's the whole document clamped at maximum 1024 window size?

@fzyzcjy
Copy link
Author

fzyzcjy commented Nov 25, 2024

It seems H100 has 2000 TFLOPS for bf16 tensor core, while 4090 is about 330 TFLOPS. Thus 8xH100 5minute = 1x4090 4hour, which looks not bad!

The major problem looks like the memory is only 24GB... So we may not be able to do some optimizations.

@alexjc
Copy link

alexjc commented Nov 25, 2024

4h feels quite high for a speed run. Too hard to test ideas, no?

Working on some memory optimizations now, should help a lot...

@fzyzcjy
Copy link
Author

fzyzcjy commented Nov 25, 2024

Surely faster would be great! But if impossible, then 4h is better than nothing :(

@naoro
Copy link

naoro commented Nov 25, 2024

It seems H100 has 2000 TFLOPS for bf16 tensor core, while 4090 is about 330 TFLOPS. Thus 8xH100 5minute = 1x4090 4hour, which looks not bad!

The major problem looks like the memory is only 24GB... So we may not be able to do some optimizations.

A100 has got 40GB, cost is about 10usd for ~12 hours with about the same tflops as 4090.
So about 3.3usd per run. Not bad, I'd think.

@fzyzcjy
Copy link
Author

fzyzcjy commented Nov 25, 2024

If we optimize for cost, then 4090 is much cheaper than A100 per hour while having same tflops. Thus as long as we manage to fit in 24GB, then maybe we can further scale down the cost.

@KellerJordan
Copy link
Owner

A note: The current cost per run on an 8xH100 is about $1.90 (since it's about $3/hr for SXM H100s)

Personally, when I don't feel like spending that much, I go back to speedrunning CIFAR-10. But I understand that might not be so interesting to everyone

@fzyzcjy
Copy link
Author

fzyzcjy commented Nov 26, 2024

Looks like 4090 is about $0.3/hr, thus 4hr = $1.2, which looks a bit cheaper. And moreover, some people have 4090s bought in their house (e.g. many people in r/LocalLlama, me, etc), while it seems less people buy A100/H100 in their house, and the bought cost is much cheaper than a cloud.

@lapp0
Copy link

lapp0 commented Nov 26, 2024

I'm also interested in this variant.

Considering the long runtime, perhaps it makes sense to compete to minimize validation loss within a 1 hour run?

@KellerJordan
Copy link
Owner

KellerJordan commented Nov 26, 2024

I would guess that halving the sequence length (and going to batch size 16) will allow fitting the run into 24G memory, without impacting performance very much. or quartering it, if that doesn't still fit

@lapp0
Copy link

lapp0 commented Nov 26, 2024

I achieved < 3.28 in a little under two hours with a few tweaks.

@KellerJordan are you interested in hosting a 1x4090 variant of the competition in this repo? If so, I'll submit a PR for 4090/train_gpt2_4090.py and 4090/run_4090.sh and update the readme.

@fzyzcjy
Copy link
Author

fzyzcjy commented Nov 26, 2024

@lapp0 That looks great - $0.3/hr x 2hr = $0.6, which is 3x cheaper than $1.9 (8xH100), and looking forward to your code!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants