Program failed to train , I am using one GPU to run the program #21

SMKamrulHasan · 2018-10-25T01:18:35Z

num train = 0, num_val = 0
Traceback (most recent call last):
File "train.py", line 157, in
main()
File "train.py", line 152, in main
num_classes=num_classes
File "/content/drive/My Drive/surgery/data/utils.py", line 56, in train
model.load_state_dict(state['model'])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 719, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for DataParallel:
Missing key(s) in state_dict: "module.encoder.0.weight", "module.encoder.0.bias", ...
...................................................

ternaus · 2018-10-25T01:38:34Z

First of all
num train = 0, num_val = 0

looks strange. Are you sure that your DataLoader defined in https://github.com/ternaus/robot-surgery-segmentation/blob/master/dataset.py is correct?

ternaus · 2018-10-25T01:39:58Z

Second
model.load_state_dict(state['model']) is trying to load a model which is happening when your folder runs/debug is not empty.

Can you delete it and try again?

SMKamrulHasan · 2018-10-25T01:58:42Z

Second
model.load_state_dict(state['model']) is trying to load a model which is happening when your folder runs/debug is not empty.

Can you delete it and try again?

Yes, I had deleted the "runs/debug" folder and tried agian. Now it solved the "RuntimeError: Error(s) in loading state_dict for DataParallel" problem but still "num train = 0, num_val = 0"

python prepare_train_val.py
python train.py --device-ids 0 --batch-size 16 --fold $3 --workers 12 --lr 0.00001 --n-epochs 20 --type binary --jaccard-weight 1 --model UNet16

Log:
num train = 0, num_val = 0
Epoch 1, lr 1e-05: : 0it [00:00, ?it/s]
/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py:2957: RuntimeWarning: Mean of empty slice.
out=out, **kwargs)
/usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:80: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
Valid loss: nan, jaccard: nan
Epoch 2, lr 1e-05: : 0it [00:00, ?it/s]
Valid loss: nan, jaccard: nan
Epoch 3, lr 1e-05: : 0it [00:00, ?it/s]
Valid loss: nan, jaccard: nan

SMKamrulHasan · 2018-10-25T02:03:51Z

First of all
num train = 0, num_val = 0

looks strange. Are you sure that your DataLoader defined in https://github.com/ternaus/robot-surgery-segmentation/blob/master/dataset.py is correct?

And my folder arrangements are:
surgery/data/models/
surgery/data/train/instrument_dataset_1
surgery/data/test/instrument_dataset_1
surgery/data/cropped_train/instrument_dataset_1
surgery/data/train.py
surgery/data/model.py
surgery/data/prepare_data.py
surgery/data/prepare_train_val.py
surgery/data/dataset.py

kimdinhthaibk · 2019-05-27T13:03:11Z

Can you give me the DATASET from the surgery/data/train/instrument_dataset_1 and surgery/data/test/instrument_dataset_1?

zapaishchykova · 2019-07-12T09:32:03Z

So for anyone encountering this error - check if you changed the problem type:
model = get_model(model_path, model_type='UNet11', problem_type='instruments')

Di1113 · 2019-07-30T08:27:20Z

Can you give me the DATASET from the surgery/data/train/instrument_dataset_1 and surgery/data/test/instrument_dataset_1?

#3 (comment)
you might find this link useful.

SMKamrulHasan changed the title ~~Program runs failed, I am using one GPU to run the program~~ Program failed to train , I am using one GPU to run the program Oct 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Program failed to train , I am using one GPU to run the program #21

Program failed to train , I am using one GPU to run the program #21

SMKamrulHasan commented Oct 25, 2018

ternaus commented Oct 25, 2018

ternaus commented Oct 25, 2018

SMKamrulHasan commented Oct 25, 2018

SMKamrulHasan commented Oct 25, 2018

kimdinhthaibk commented May 27, 2019

zapaishchykova commented Jul 12, 2019

Di1113 commented Jul 30, 2019

Program failed to train , I am using one GPU to run the program #21

Program failed to train , I am using one GPU to run the program #21

Comments

SMKamrulHasan commented Oct 25, 2018

ternaus commented Oct 25, 2018

ternaus commented Oct 25, 2018

SMKamrulHasan commented Oct 25, 2018

SMKamrulHasan commented Oct 25, 2018

kimdinhthaibk commented May 27, 2019

zapaishchykova commented Jul 12, 2019

Di1113 commented Jul 30, 2019