Using the Lilith optimizer on nanogpt, messing with lr, and multiple schdulers
deepseek step based implementation -> link
- Test 26, setting dropout to low values(~0.01) is beneficial, but linear descent compared to the smooth adam curves
- Test 25, deepseek scheduler 2:4:4 and 8:1:1, at acc 1000, also these runs now finishing in like 1/4 the time is really useful!
- Test 24, graphs for acc=10, acc=50, acc=1000, helps boost training a little bit early on, slightly better curves, slightly lower loss, might just blow up this value for a +1% boost. Here Lilith runs with accelration and a bs 48 come really close to an AdamW run with bs 180, Lilith train time here was almost 5x faster, 70ms per step vs 300ms per step, 4x less mem for batches while 5x faster.
- Test 23, accelration at 4, matches acc=2, and yet these values are looking they match larger bs, this optimizer is fire
- Test 22, match beta1_m to adam beta 1 and beta_v near adams beta 2, also trying accelration set to 2, here in the graph we have an overfit AdamW vs Lilith with accelration=2, following the same path with less overfitting
Test 21, bs=600, My setup cant see batch_sizes of 600+ with ooms, Lilith is like 10% faster, interstingly, faster and equal or better?!
- Test 20, Adam can match lilith at bs=180, testing bs=360 (Yellow and Orange)
- Test 19, scaling batchsize to 360, appears to be having a similar effect so far, but better, explains euclaise's tests, his bs=1024
- Test 18, scaling batchsize to 180 for a try, lr 3e-4, cosine schedule, sota result by a margin, beats adam?! It shows the same behaviour as adamw on large batches, but better? This could be the large scale training optimizer?
- Test 17, using the deepseek step bases again, first graph 2:4:4, second graph 8:1:1, 8:1:1 is a really successful scheduler, achieved the same val loss as cosine adamw
-
brand new version, due to corruption lost the graphs, but the new good lr is 3e-4, from test 16
-
Test 15, Trying the deepseek based lr steps once again, 2:4:4 (first graph, lr 1e-4, due to numerical instability) and 8:1:1 (second graph, lr 8e-5), the first step change in 2:4:4 worked, but it flatlined afterwards, some progress on that end, while lr on the deepseek values was much much better, almost cosine
- Test 14, set beta 1 and beta 2 to 0.95 and 0.98, slightly worse, and trial of 0.98 and 0.999999 was even worse but good tuning might give a +1% boost,
- Test 13, lr 8e-5, was initially 5e-5, but it was too low, couldnt affect it very well, 8e-5 appears to be an even better initial sweet spot than 1e-4, tho it starts converging
- Test 12, the same as 9, but just testing batch_size and lowering iters for efficiency, slightly above the sota run, but thats expected from larger batches, trains on 1.2x more tokens than before, for 1/3 the time, lilith is scalable, just like AdamW
-
Test 11, changed ema_k from 0 to 1 for better numerical stability, and using cosine lr schedule, lr = 1e-3
-
Note: There is numerical stability, no Nans, but loss is very volatile, literally unlearning
-
Test 10, using Triangular lr schedule, literally doesnt want to work, just like the previous tlr spike, gonna stick with multistep or cosine
- Test 9, the orange bar being the new lilith, lr=1e-4, cosine scheduler, literally matches adamW for awhile, before flattening earlier, but val losses match, at ~1.47, so maybe its just not as prone to overfit?
- Test 1, Lilith default params, using cosine LR, AdamW params from Karpathy, cosine LR
- Test 2, Lilith some slight LR changes(lr 1e-2), using TLR, AdamW params from Karpathy, cosine LR
- Test 3, Lilith lR (3e-4), using cosine lr, adamw the same
- Test 4, current lilith in blue, lr (1e-4), cosine lr
- Test 5, current lilith in green, lr (5e-5), cosine lr, too low, and the model cant seem to get as low as adamw
- further tests to try and reintroduce TLR, then try a deepseek style stepwise lr
- Test 6, TRL reintroduction(pink), vs sota lilith (blue), and adamW (red), lr 1e-4, didn't go well, TRL is too unstable, will try deepseek stepbased lr later
- Test 7, using the deepseek based lr, in yellow, lr 1e-4, 20%,40%,40% partitions, didn't do anything, but that just maybe my infamiliarity with the step based version
- Test 8, using the same step partitions in the deepseek paper, teal line, lr 1e-4, 80%,10%,10% partitions, I need to fix it, the lr freaks out and goes to zero, but this optimizer does not seem to like the scheduler whatsoever either, literally no change/drop in all cases