Cuda out of memory: Training from scratch #1329
Replies: 2 comments 9 replies
-
Don't use cachedataset.. use persistent or normal dataset I hope you have not increased the batch size.. |
Beta Was this translation helpful? Give feedback.
-
You can use original Dataset instead of CachedDataset. Another possible reason could be that some new images are of larger size, valdiation/inference could saved a large matrix that caused OOM. You could also try stop and restart monailabel server, since if there is a cached inference model, run training will not release that part of memory, a major part of memory will be reserved. You can restart to release inference memory. |
Beta Was this translation helpful? Give feedback.
-
Hi I am trying to train a new model based on the segmentation model. In my labels/final folder there are 14 files. The training is based on 3 validation images and 11 training images. The process so far was to submit a label after I finished manually segmentation and then runnning a training session using 50 epochs. Every was fine ( 10 successful sessions) until I went over 12 labels now I am getting a Cuda out of memory error. I have a RTX 3060 with 12GB GPU.
My question is; is it just a memory issue or I am doing something wrong.
I goal is to submit at least 20 more labels. If it is a memory issue, how large of a GPU will I eventually need?
How are others dealing with this limitation. Is everyone running 24 or 48GB GPU's?
The following is the error message I am getting
[2023-02-26 19:57:13,455] [3824] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:257) - Got new best metric of train_mean_dice: 0.5350669026374817
[2023-02-26 19:57:13,456] [3824] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:201) - Epoch[1] Metrics -- train_artery_mean_dice: 0.5129 train_cyst_mean_dice: 0.5272 train_left kidney_mean_dice: 0.7152 train_mass_mean_dice: 0.5969 train_mean_dice: 0.5351 train_right kidney_mean_dice: 0.5822 train_vein_mean_dice: 0.5004
[2023-02-26 19:57:13,456] [3824] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:212) - Key metric: train_mean_dice best value: 0.5350669026374817 at epoch: 1
[2023-02-26 19:57:13,456] [3824] [MainThread] [INFO] (ignite.engine.engine.SupervisedEvaluator:876) - Engine run resuming from iteration 0, epoch 0 until 1 epochs
[2023-02-26 19:57:16,780] [3824] [MainThread] [ERROR] (ignite.engine.engine.SupervisedEvaluator:1086) - Current run is terminating due to exception: applying transform <monai.transforms.compose.Compose object at 0x7fe3c7be98e0>
[2023-02-26 19:57:16,780] [3824] [MainThread] [ERROR] (ignite.engine.engine.SupervisedEvaluator:180) - Exception: applying transform <monai.transforms.compose.Compose object at 0x7fe3c7be98e0>
Traceback (most recent call last):
File "/home/sam/anaconda3/envs/monailabel-env/lib/python3.9/site-packages/monai/transforms/transform.py", line 102, in apply_transform
return _apply_transform(transform, data, unpack_items)
File "/home/sam/anaconda3/envs/monailabel-env/lib/python3.9/site-packages/monai/transforms/transform.py", line 66, in _apply_transform
return transform(parameters)
File "/home/sam/anaconda3/envs/monailabel-env/lib/python3.9/site-packages/monai/transforms/post/dictionary.py", line 202, in call
d[key] = self.converter(d[key], argmax, to_onehot, threshold, rounding)
File "/home/sam/anaconda3/envs/monailabel-env/lib/python3.9/site-packages/monai/transforms/post/array.py", line 220, in call
img_t = one_hot(
File "/home/sam/anaconda3/envs/monailabel-env/lib/python3.9/site-packages/monai/networks/utils.py", line 158, in one_hot
o = torch.zeros(size=sh, dtype=dtype, device=labels.device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.83 GiB (GPU 0; 11.76 GiB total capacity; 6.59 GiB already allocated; 1.79 GiB free; 8.05 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/sam/anaconda3/envs/monailabel-env/lib/python3.9/site-packages/monai/transforms/transform.py", line 102, in apply_transform
return _apply_transform(transform, data, unpack_items)
File "/home/sam/anaconda3/envs/monailabel-env/lib/python3.9/site-packages/monai/transforms/transform.py", line 66, in apply_transform
return transform(parameters)
File "/home/sam/anaconda3/envs/monailabel-env/lib/python3.9/site-packages/monai/transforms/compose.py", line 174, in call
input = apply_transform(transform, input, self.map_items, self.unpack_items, self.log_stats)
File "/home/sam/anaconda3/envs/monailabel-env/lib/python3.9/site-packages/monai/transforms/transform.py", line 129, in apply_transform
raise RuntimeError(f"applying transform {transform}") from e
RuntimeError: applying transform <monai.transforms.post.dictionary.AsDiscreted object at 0x7fe3c7be9250>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/sam/anaconda3/envs/monailabel-env/lib/python3.9/site-packages/ignite/engine/engine.py", line 1068, in _run_once_on_dataset_as_gen
self.state.output = self._process_function(self, self.state.batch)
File "/home/sam/anaconda3/envs/monailabel-env/lib/python3.9/site-packages/monai/engines/evaluator.py", line 308, in _iteration
engine.fire_event(IterationEvents.MODEL_COMPLETED)
File "/home/sam/anaconda3/envs/monailabel-env/lib/python3.9/site-packages/ignite/engine/engine.py", line 449, in fire_event
return self._fire_event(event_name)
File "/home/sam/anaconda3/envs/monailabel-env/lib/python3.9/site-packages/ignite/engine/engine.py", line 425, in _fire_event
func(*first, *(event_args + others), **kwargs)
File "/home/sam/anaconda3/envs/monailabel-env/lib/python3.9/site-packages/monai/engines/workflow.py", line 224, in _run_postprocessing
engine.state.batch[i], engine.state.output[i] = engine_apply_transform(b, o, posttrans)
File "/home/sam/anaconda3/envs/monailabel-env/lib/python3.9/site-packages/monai/engines/utils.py", line 258, in engine_apply_transform
transformed_data = apply_transform(transform, data)
File "/home/sam/anaconda3/envs/monailabel-env/lib/python3.9/site-packages/monai/transforms/transform.py", line 129, in apply_transform
raise RuntimeError(f"applying transform {transform}") from e
RuntimeError: applying transform <monai.transforms.compose.Compose object at 0x7fe3c7be98e0>
[2023-02-26 19:57:16,782] [3824] [MainThread] [ERROR] (ignite.engine.engine.SupervisedEvaluator:992) - Engine run is terminating due to exception: applying transform <monai.transforms.compose.Compose object at 0x7fe3c7be98e0>
[2023-02-26 19:57:16,782] [3824] [MainThread] [ERROR] (ignite.engine.engine.SupervisedEvaluator:180) - Exception: applying transform <monai.transforms.compose.Compose object at 0x7fe3c7be98e0>
Traceback (most recent call last):
File "/home/sam/anaconda3/envs/monailabel-env/lib/python3.9/site-packages/monai/transforms/transform.py", line 102, in apply_transform
return _apply_transform(transform, data, unpack_items)
File "/home/sam/anaconda3/envs/monailabel-env/lib/python3.9/site-packages/monai/transforms/transform.py", line 66, in _apply_transform
return transform(parameters)
File "/home/sam/anaconda3/envs/monailabel-env/lib/python3.9/site-packages/monai/transforms/post/dictionary.py", line 202, in call
d[key] = self.converter(d[key], argmax, to_onehot, threshold, rounding)
File "/home/sam/anaconda3/envs/monailabel-env/lib/python3.9/site-packages/monai/transforms/post/array.py", line 220, in call
img_t = one_hot(
File "/home/sam/anaconda3/envs/monailabel-env/lib/python3.9/site-packages/monai/networks/utils.py", line 158, in one_hot
o = torch.zeros(size=sh, dtype=dtype, device=labels.device)
Beta Was this translation helpful? Give feedback.
All reactions