Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] dp change-bias will give much large model #4348

Open
QuantumMisaka opened this issue Nov 13, 2024 · 5 comments
Open

[BUG] dp change-bias will give much large model #4348

QuantumMisaka opened this issue Nov 13, 2024 · 5 comments
Labels

Comments

@QuantumMisaka
Copy link

Bug summary

From pre-trained multi-head model, dp --pt change-bias will give a model with much larger size. However, finetuen with numb_steps: 0 will have no problem:

(base) [2201110432@wm2-login01 fine2]$ ll -h
total 465M
-rw-rw-r-- 1 2201110432 2201110432   24 Nov 13 15:46 checkpoint
lrwxrwxrwx 1 2201110432 2201110432   27 Nov 13 15:36 dpa230m.pt -> DPA2_medium_28_10M_beta4.pt
-rw-rw-r-- 1 2201110432 2201110432 338M Nov 13 15:45 dpa230m_updated.pt
-rw-rw-r-- 1 2201110432 2201110432  800 Nov 13 15:46 dpa2.hdf5
-rw-rw-r-- 1 2201110432 2201110432 119M Nov 13 15:35 DPA2_medium_28_10M_beta4.pt
-rw-rw-r-- 1 2201110432 2201110432 108K Nov 13 15:46 dpfine_4279321.err
-rw-rw-r-- 1 2201110432 2201110432    0 Nov 13 15:43 dpfine_4279321.out
-rw-r--r-- 1 2201110432 2201110432  692 Nov 13 15:43 fine.slurm
-rw-rw-r-- 1 2201110432 2201110432 2.4K Nov 13 15:36 input.json
-rw-rw-r-- 1 2201110432 2201110432 3.0K Nov 13 15:45 input_v2_compat.json
-rw-rw-r-- 1 2201110432 2201110432    0 Nov 13 15:46 lcurve.out
-rw-rw-r-- 1 2201110432 2201110432 7.9M Nov 13 15:46 model_finetune.ckpt-0.pt
lrwxrwxrwx 1 2201110432 2201110432   24 Nov 13 15:46 model_finetune.ckpt.pt -> model_finetune.ckpt-0.pt
-rw-rw-r-- 1 2201110432 2201110432 4.8K Nov 13 15:45 out.json

the model after change-bias dpa230m_updated.pt have much larger size even more than original model, but the 0-step finetuned model model_finetune.ckpt-0.pt have much small size which is in desire.

And, if try to load the model after change-bias, the head should be selected, which is also not in desire

In [1]: from deepmd.infer.deep_pot import DeepPot

In [2]: model = DeepPot("dpa230m_updated.pt")
To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
/data/softwares/miniconda3/envs/deepmd-3b4/lib/python3.11/site-packages/deepmd/pt/infer/deep_eval.py:110: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state_dict = torch.load(model_file, map_location=env.DEVICE)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[2], line 1
----> 1 model = DeepPot("dpa230m_updated.pt")

File /data/softwares/miniconda3/envs/deepmd-3b4/lib/python3.11/site-packages/deepmd/infer/deep_eval.py:334, in DeepEval.__init__(self, model_file, auto_batch_size, neighbor_list, *args, **kwargs)
    326 def __init__(
    327     self,
    328     model_file: str,
   (...)
    332     **kwargs: Any,
    333 ) -> None:
--> 334     self.deep_eval = DeepEvalBackend(
    335         model_file,
    336         self.output_def,
    337         *args,
    338         auto_batch_size=auto_batch_size,
    339         neighbor_list=neighbor_list,
    340         **kwargs,
    341     )
    342     if self.deep_eval.get_has_spin() and hasattr(self, "output_def_mag"):
    343         self.deep_eval.output_def = self.output_def_mag

File /data/softwares/miniconda3/envs/deepmd-3b4/lib/python3.11/site-packages/deepmd/pt/infer/deep_eval.py:121, in DeepEval.__init__(self, model_file, output_def, auto_batch_size, neighbor_list, head, *args, **kwargs)
    118 if isinstance(head, int):
    119     head = model_keys[0]
    120 assert (
--> 121     head is not None
    122 ), f"Head must be set for multitask model! Available heads are: {model_keys}"
    123 assert (
    124     head in model_keys
    125 ), f"No head named {head} in model! Available heads are: {model_keys}"
    126 self.input_param = self.input_param["model_dict"][head]

AssertionError: Head must be set for multitask model! Available heads are: ['Domains_Alloy', 'Domains_Anode', 'Domains_Cluster', 'Domains_Drug', 'Domains_FerroEle', 'Domains_OC2M', 'Domains_SSE-PBE', 'Domains_SemiCond', 'H2O_H2O-PD', 'Metals_AgAu-PBE', 'Metals_AlMgCu', 'Metals_Cu', 'Metals_Sn', 'Metals_Ti', 'Metals_V', 'Metals_W', 'Others_C12H26', 'Others_HfO2', 'Domains_ANI', 'Domains_SSE-PBESol', 'Domains_Transition1x', 'H2O_H2O-DPLR', 'H2O_H2O-PBE0TS-MD', 'H2O_H2O-PBE0TS', 'H2O_H2O-SCAN0', 'Metals_AgAu-PBED3', 'Others_In2Se3', 'MP_traj_v024_alldata_mixu']

Where the 0-step finetuned model have no problem

In [3]: model = DeepPot("model_finetune.ckpt-0.pt")
/data/softwares/miniconda3/envs/deepmd-3b4/lib/python3.11/site-packages/deepmd/pt/infer/deep_eval.py:110: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state_dict = torch.load(model_file, map_location=env.DEVICE)
You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes * natoms). The default value is 1024.

DeePMD-kit Version

v3.0.0b4

Backend and its version

pytorch 2.5.1

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

command for change-bias:

dp --pt change-bias dpa230m.pt -s ../../data-clean4_radsp/train --model-branch Domains_OC2M

command for 0-step finetune

dp --pt train input.json --finetune dpa230m.pt --model-branch Domains_OC2M

coresponding input.json

{
  "_comment": "that's all",
  "model": {
    "type_map": [
      "C",
      "Fe",
      "H",
      "O"
    ],
    "descriptor": {
      "type": "dpa2",
      "repinit": {
        "tebd_dim": 8,
        "rcut": 6.0,
        "rcut_smth": 0.5,
        "nsel": 120,
        "neuron": [
          25,
          50,
          100
        ],
        "axis_neuron": 12,
        "activation_function": "tanh",
        "three_body_sel": 40,
        "three_body_rcut": 4.0,
        "three_body_rcut_smth": 3.5,
        "use_three_body": true
      },
      "repformer": {
        "rcut": 4.0,
        "rcut_smth": 3.5,
        "nsel": 40,
        "nlayers": 6,
        "g1_dim": 128,
        "g2_dim": 32,
        "attn2_hidden": 32,
        "attn2_nhead": 4,
        "attn1_hidden": 128,
        "attn1_nhead": 4,
        "axis_neuron": 4,
        "update_h2": false,
        "update_g1_has_conv": true,
        "update_g1_has_grrg": true,
        "update_g1_has_drrd": true,
        "update_g1_has_attn": false,
        "update_g2_has_g1g1": false,
        "update_g2_has_attn": true,
        "update_style": "res_residual",
        "update_residual": 0.01,
        "update_residual_init": "norm",
        "attn2_has_gate": true,
        "use_sqrt_nnei": true,
        "g1_out_conv": true,
        "g1_out_mlp": true
      },
      "add_tebd_to_repinit_out": false
    },
    "fitting_net": {
      "neuron": [
        240,
        240,
        240
      ],
      "resnet_dt": true,
      "seed": 19090,
      "_comment": " that's all"
    },
    "_comment": " that's all"
  },
  "learning_rate": {
    "type": "exp",
    "decay_steps": 2000,
    "start_lr": 0.001,
    "stop_lr": 3.51e-08,
    "_comment": "that's all"
  },
  "loss": {
    "type": "ener",
    "start_pref_e": 0.02,
    "limit_pref_e": 1,
    "start_pref_f": 1000,
    "limit_pref_f": 1,
    "start_pref_v": 0,
    "limit_pref_v": 0,
    "_comment": " that's all"
  },
  "training": {
    "stat_file": "./dpa2.hdf5",
    "training_data": {
      "systems": "../../data-clean4_radsp/train/",
      "batch_size": "auto",
      "_comment": "that's all"
    },
    "numb_steps": 0,
    "warmup_steps": 0,
    "gradient_max_norm": 5.0,
    "max_ckpt_keep":20,
    "seed": 19090,
    "save_ckpt": "model_finetune.ckpt",
    "disp_file": "lcurve.out",
    "disp_freq": 1000,
    "save_freq": 20000,
    "_comment": "that's all"
  }
}

Steps to Reproduce

run these command in any dataset

Further Information, Files, and Links

No response

@QuantumMisaka QuantumMisaka changed the title [BUG] change-bias [BUG] dp change-bias command have abnormal result Nov 13, 2024
@iProzd
Copy link
Collaborator

iProzd commented Nov 13, 2024

Finetune with numb_steps: 0 will save only one head model, while change-bias will keep multi-head model, which is expected.

@QuantumMisaka
Copy link
Author

Finetune with numb_steps: 0 will save only one head model, while change-bias will keep multi-head model, which is expected.

Thanks! However the oversize of change-bias model is also a problem

@QuantumMisaka QuantumMisaka changed the title [BUG] dp change-bias command have abnormal result [BUG] dp change-bias will give much large model Nov 13, 2024
@njzjz
Copy link
Member

njzjz commented Nov 13, 2024

@QuantumMisaka could you post all keys in the checkpoint

import torch

def get_all_keys(d, prefix=""):
    """Gets all keys from a nested dictionary with slash-separated paths."""
    keys = []
    for k, v in d.items():
        if isinstance(v, dict):
            keys.extend(get_all_keys(v, prefix + str(k) + "/"))
        else:
            keys.append(prefix + str(k))
    return keys

print(get_all_keys(torch.load("dpa230m.pt")))
print(get_all_keys(torch.load("dpa230m_updated.pt")))

@QuantumMisaka
Copy link
Author

@njzjz They print the same results

(base) [2201110432@wm2-data01 fine2]$ diff allkeys_base.txt allkeys_cbias.txt 
(base) [2201110432@wm2-data01 fine2]$

allkeys_base.txt
allkeys_cbias.txt

@njzjz
Copy link
Member

njzjz commented Nov 14, 2024

The reason should be the abuse of deepcopy

model_state_dict = copy.deepcopy(old_state_dict.get("model", old_state_dict))

(or the copy of tensors that happens in other places)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants