Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Missing _is_using_mup when resume checkpoint #198

Open
xrsrke opened this issue Jun 14, 2024 · 1 comment
Open

[Bug] Missing _is_using_mup when resume checkpoint #198

xrsrke opened this issue Jun 14, 2024 · 1 comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed

Comments

@xrsrke
Copy link
Member

xrsrke commented Jun 14, 2024

Can't resume checkpoint from a model that doesn't use mup.

Traceback (most recent call last):
  File "/fsx/phuc/projects/reference/nanotron/run_generate.py", line 255, in <module>
    main()
  File "/fsx/phuc/projects/reference/nanotron/run_generate.py", line 71, in main
    config = get_config_from_file((args.ckpt_path / "config.yaml").as_posix())
  File "/fsx/phuc/projects/reference/nanotron/src/nanotron/config/config.py", line 430, in get_config_from_file
    config = get_config_from_dict(
  File "/fsx/phuc/projects/reference/nanotron/src/nanotron/config/config.py", line 391, in get_config_from_dict
    return from_dict(
  File "/fsx/phuc/projects/reference/env/lib/python3.10/site-packages/dacite/core.py", line 64, in from_dict
    value = _build_value(type_=field_type, data=field_data, config=config)
  File "/fsx/phuc/projects/reference/env/lib/python3.10/site-packages/dacite/core.py", line 99, in _build_value
    data = from_dict(data_class=type_, data=data, config=config)
  File "/fsx/phuc/projects/reference/env/lib/python3.10/site-packages/dacite/core.py", line 81, in from_dict
    instance = data_class(**init_values)
  File "<string>", line 8, in __init__
  File "/fsx/phuc/projects/reference/nanotron/src/nanotron/config/config.py", line 199, in __post_init__
    self.model_config._is_using_mup = isinstance(self.init_method, SpectralMupInit)
AttributeError: 'dict' object has no attribute '_is_using_mup'
[2024-06-14 03:38:04,551] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2747571) of binary: /fsx/phuc/projects/reference/env/bin/python
Traceback (most recent call last):
  File "/fsx/phuc/projects/reference/env/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/fsx/phuc/projects/reference/env/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/fsx/phuc/projects/reference/env/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/fsx/phuc/projects/reference/env/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/fsx/phuc/projects/reference/env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/fsx/phuc/projects/reference/env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_generate.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-14_03:38:04
  host      : ip-26-0-160-103.ec2.internal
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2747571)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
@xrsrke xrsrke added bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed labels Jun 14, 2024
@bpopeters
Copy link

I am also encountering this problem when I attempt to run run_generate.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants