2021.10.29 v0.1

change mixins from ModuleList to ModuleDict
return tokens and mems in fill_sequence, and mems becomes a tensor.
CachedAutoRegressiveMixin

How to migrate old SAT ckpt to new version?

Example:

import torch
old = torch.load('xxxxx/mp_rank_00_model_states.pt.old', map_location='cpu')

# replace names, mixins index to keys
oldm = old['module']
for k in list(oldm.keys()):
    if k.startswith('mixins.0'):
        new_k = k.replace('mixins.0', 'mixins.extra_position_embedding')
    elif k.startswith('mixins.1'):
        new_k = k.replace('mixins.1', 'mixins.attention_plus')
    else:
        continue
    oldm[new_k] = oldm[k]
    del oldm[k]
# save to destination    
torch.save(old, 'xxxxx/mp_rank_00_model_states.pt')

for the older framework, you also need:

old['module']['transformer.word_embeddings.weight'] = old['module']['word_embeddings.weight']
del old['module']['word_embeddings.weight']

2021.11.5 v0.1.2

Add generation.autoregressive_sampling.evalute_perplexity
fix Runtime Error in skipping Nan Loss

2021.12.13 v0.1.4

Add non_conflict attention_fn
Add Prefix-Tuning
Now, you can use kw_args['output_this_layer'] (any hooks in the transformer layers) to return values to final outputs and kw_args['output_cross_layer'] to pass values to kw_args in the next layer.

Examples:

def attention_fn(...some_args):
    ...
    kw_args['output_this_layer']['mem_kv'] = cache_kv
    ...

This will let the key 'mem_kv' appear in the outputs_per_layers[i] of logits, *outputs_per_layers = model(...).

def attention_fn(...some_args, **kw_args):
    ...
    kw_args['output_cross_layer']['last_attention_map'] = attention_map
    ...

This will let the key 'last_attention_map' appear in the next layer's kw_args (all hooks).

2021.12.13 v0.1.7

Ensure enough training data, no longer always 200 times
You can use kw_args['cross_layer_output']['new_key']=xxx to pass other results to each layer in position/word_embedding_forward.
Add --train-data-weights.

2022.1.13 v0.1.9

Add Vit
Fix evaluation all_reduce bug

2022.6.3 v2.0

split all the default hooks out
change the order, model hooks will not override all the things. They now are the same as mixin hooks added in the front of all the mixins.

2022.6.6 v2.0

from_pretrained now auto downloads models. There are two kinds of usages: SomeModel.from_pretrained(args, name) will load the weights of name model to a SomeModel with the same model arch hyper-params with name; AutoModel.from_pretrained(args, name) will return an official model (model_class Class) with the pretrained weights.
ENV SAT_HOME is where we put the models in. Set it in your shell file.
don't necessarily need deepspeed_config, or pass model arch hyper-params for from_pretrained. Use zero-stage 0/1/2.

2022.6.27

Fix *flat_output bug.
fix defualt mpu init_method bug.

2023.4.9

Large update v.0.3.0

delete --sandwich-ln
from_pretrained(args, name) => from_pretrained(name, args=None)
MODEL_URLS fix typo
enable model-only mode

2023.4.11

v.0.3.1 refactor SwissArmyTransformer as sat (package name SwissArmyTransformer)

2023.4.21

v 0.3.2 fix model-only "create then inference" bug support deepspeed 0.8.x & 0.9.x model register first try

2023.4.23

v 0.3.3 change the fp16 & to cuda order in get_model.

2023.5.15

v 0.3.4

add example for nested transformer models
move all print to logging, set SAT_LOGLEVEL to control

2023.5.16

v. 0.3.5

add repetition penalty
add quantization

2023.5.17

v. 0.3.6 support no deepspeed model-only test cpu inference test windows

2023.6.1

v. 0.3.7 update vit add qlora/lora2

2023.7.3

v. 0.4.0

add xfomers memory efficient attention.
pytorch 2.0 auto fast attention, attention_fn dispatch via version.
add llama and chatglm2.
add split model for model-parallel in inference mode.
add r2 download

2023.7.13

v. 0.4.1

better model parallel support (training mode split)
better default zero 1/2 config
test bf16 training
change qkv order of chatglm1
only use pytorch 2.0 attention when full / causal.

2023.9.10

v. 0.4.6

add droppath and checkpoint last layer skip
support multiple webdataset weighting
fix lora merging
add different lr in different parts, add a 'lr' attr for parameters in the disable_untrainable_params.

2024.1.11

v. 0.4.10

fix model parallel init possible bug by additional broadcast
add nsys profiling
add gated mlp option
support batch_from_same_dataset for multi-webds
fix cmp kernel quant no bias bug

2024.1.18

v. 0.4.11

fix the tarfile buffer_size bug in 0.4.9 and 0.4.10.
fix potential problem to pass a mixed-device model to training_main
fix emaadam no use error introduced in 0.4.9 and 0.4.10.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHANGE_LOG.md

CHANGE_LOG.md

2021.10.29 v0.1

How to migrate old SAT ckpt to new version?

2021.11.5 v0.1.2

2021.12.13 v0.1.4

2021.12.13 v0.1.7

2022.1.13 v0.1.9

2022.6.3 v2.0

2022.6.6 v2.0

2022.6.27

2023.4.9

2023.4.11

2023.4.21

2023.4.23

2023.5.15

2023.5.16

2023.5.17

2023.6.1

2023.7.3

2023.7.13

2023.9.10

2024.1.11

2024.1.18

Files

CHANGE_LOG.md

Latest commit

History

CHANGE_LOG.md

File metadata and controls

2021.10.29 v0.1

How to migrate old SAT ckpt to new version?

2021.11.5 v0.1.2

2021.12.13 v0.1.4

2021.12.13 v0.1.7

2022.1.13 v0.1.9

2022.6.3 v2.0

2022.6.6 v2.0

2022.6.27

2023.4.9

2023.4.11

2023.4.21

2023.4.23

2023.5.15

2023.5.16

2023.5.17

2023.6.1

2023.7.3

2023.7.13

2023.9.10

2024.1.11

2024.1.18