Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Program failed to train , I am using one GPU to run the program #21

Open
SMKamrulHasan opened this issue Oct 25, 2018 · 7 comments
Open

Comments

@SMKamrulHasan
Copy link

num train = 0, num_val = 0
Traceback (most recent call last):
File "train.py", line 157, in
main()
File "train.py", line 152, in main
num_classes=num_classes
File "/content/drive/My Drive/surgery/data/utils.py", line 56, in train
model.load_state_dict(state['model'])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 719, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for DataParallel:
Missing key(s) in state_dict: "module.encoder.0.weight", "module.encoder.0.bias", ...
...................................................

@SMKamrulHasan SMKamrulHasan changed the title Program runs failed, I am using one GPU to run the program Program failed to train , I am using one GPU to run the program Oct 25, 2018
@ternaus
Copy link
Owner

ternaus commented Oct 25, 2018

First of all
num train = 0, num_val = 0

looks strange. Are you sure that your DataLoader defined in https://github.com/ternaus/robot-surgery-segmentation/blob/master/dataset.py is correct?

@ternaus
Copy link
Owner

ternaus commented Oct 25, 2018

Second
model.load_state_dict(state['model']) is trying to load a model which is happening when your folder runs/debug is not empty.

Can you delete it and try again?

@SMKamrulHasan
Copy link
Author

Second
model.load_state_dict(state['model']) is trying to load a model which is happening when your folder runs/debug is not empty.

Can you delete it and try again?

Yes, I had deleted the "runs/debug" folder and tried agian. Now it solved the "RuntimeError: Error(s) in loading state_dict for DataParallel" problem but still "num train = 0, num_val = 0"

python prepare_train_val.py
python train.py --device-ids 0 --batch-size 16 --fold $3 --workers 12 --lr 0.00001 --n-epochs 20 --type binary --jaccard-weight 1 --model UNet16

Log:
num train = 0, num_val = 0
Epoch 1, lr 1e-05: : 0it [00:00, ?it/s]
/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py:2957: RuntimeWarning: Mean of empty slice.
out=out, **kwargs)
/usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:80: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
Valid loss: nan, jaccard: nan
Epoch 2, lr 1e-05: : 0it [00:00, ?it/s]
Valid loss: nan, jaccard: nan
Epoch 3, lr 1e-05: : 0it [00:00, ?it/s]
Valid loss: nan, jaccard: nan

@SMKamrulHasan
Copy link
Author

First of all
num train = 0, num_val = 0

looks strange. Are you sure that your DataLoader defined in https://github.com/ternaus/robot-surgery-segmentation/blob/master/dataset.py is correct?

And my folder arrangements are:
surgery/data/models/
surgery/data/train/instrument_dataset_1
surgery/data/test/instrument_dataset_1
surgery/data/cropped_train/instrument_dataset_1
surgery/data/train.py
surgery/data/model.py
surgery/data/prepare_data.py
surgery/data/prepare_train_val.py
surgery/data/dataset.py

@kimdinhthaibk
Copy link

Can you give me the DATASET from the surgery/data/train/instrument_dataset_1 and surgery/data/test/instrument_dataset_1?

@zapaishchykova
Copy link

So for anyone encountering this error - check if you changed the problem type:
model = get_model(model_path, model_type='UNet11', problem_type='instruments')

@Di1113
Copy link

Di1113 commented Jul 30, 2019

Can you give me the DATASET from the surgery/data/train/instrument_dataset_1 and surgery/data/test/instrument_dataset_1?

#3 (comment)
you might find this link useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants