Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory #94

Open
reorocha opened this issue Dec 14, 2023 · 0 comments
Open

CUDA out of memory #94

reorocha opened this issue Dec 14, 2023 · 0 comments

Comments

@reorocha
Copy link

Dear all,

I am currently attempting to execute model-angelo for a sizable system utilizing a FASTA file containing multiple sequences. However, following the initial C-alpha prediction, the program crashes and displays a "CUDA out of memory" message. Notably, when running the software in the no_seq mode, it performs flawlessly. Below, you will find all the error messages I have encountered. I am utilizing model-angelo on a Linux-Ubuntu 18.04 system with a 2x2080Ti Nvidia graphics card.
Thank you for your assistance!

model_angelo build -v map.mrc -pf protein.fasta -rf rna.fasta -o cryosparc_P45_J63_006_volume_map_sharp-Zflipped --device 0,1

---------------------------- ModelAngelo -----------------------------
By Kiarash Jamali, Scheres Group, MRC Laboratory of Molecular Biology
--------------------- Initial C-alpha prediction ---------------------
100%|██████████████████████████████████████████████| 5832/5832 [59:11<00:00, 1.64it/s]
------------------ GNN model refinement, round 1 / 3 ------------------
0%| | 0/16756 [00:00<?, ?it/s]2023-12-14 at 10:11:22 | ERROR | Error in ModelAngelo
Traceback (most recent call last):
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/model_angelo/gnn/inference.py", line 150, in infer
collated_results, protein = collate_nn_results(
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/model_angelo/utils/gnn_inference_utils.py", line 135, in collate_nn_results
collated_results["pred_positions"][indices[update_slice]] += results[
TypeError: 'NoneType' object is not subscriptable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/model_angelo/apps/build.py", line 250, in main
gnn_output = gnn_infer(gnn_infer_args)
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/model_angelo/gnn/inference.py", line 130, in infer
with MultiGPUWrapper(model_definition_path, state_dict_path, device_names, args.fp16) as wrapper:
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/model_angelo/models/multi_gpu_wrapper.py", line 186, in exit
self.del()
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/model_angelo/models/multi_gpu_wrapper.py", line 180, in del
self.proc_ctx.join()
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 163, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
fn(i, *args)
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/model_angelo/models/multi_gpu_wrapper.py", line 102, in run_inference
raise e
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/model_angelo/models/multi_gpu_wrapper.py", line 97, in run_inference
output = model(**inference_data.data)
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sissler/.cache/torch/hub/checkpoints/model_angelo_v1.0/nucleotides/gnn/model.py", line 17, in forward
return self.ipa(*args, **kwargs)
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/model_angelo/gnn/multi_layer_ipa.py", line 174, in forward
) = self.seq_attentions[idx](
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/model_angelo/gnn/sequence_attention.py", line 150, in forward_checkpoint
return torch.utils.checkpoint.checkpoint(
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/torch/_compile.py", line 24, in inner
return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
return fn(*args, **kwargs)
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 458, in checkpoint
ret = function(*args, **kwargs)
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/model_angelo/gnn/sequence_attention.py", line 201, in _intern_forward
sequence_attention_weights = padded_sequence_softmax(
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/model_angelo/utils/torch_utils.py", line 337, in padded_sequence_softmax
padded_softmax = padded_softmax * padded_mask # Mask out padded values
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 614.00 MiB. GPU 0 has a total capacty of 10.76 GiB of which 326.81 MiB is free. Process 21144 has 1.64 GiB memory in use. Including non-PyTorch memory, this process has 8.55 GiB memory in use. Process 21608 has 245.00 MiB memory in use. Of the allocated memory 7.24 GiB is allocated by PyTorch, and 254.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

0%| | 0/16756 [00:16<?, ?it/s]
Exception ignored in: <function MultiGPUWrapper.del at 0x7fb712e12a70>
Traceback (most recent call last):
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/model_angelo/models/multi_gpu_wrapper.py", line 180, in del
self.proc_ctx.join()
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 163, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
fn(i, *args)
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/model_angelo/models/multi_gpu_wrapper.py", line 102, in run_inference
raise e
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/model_angelo/models/multi_gpu_wrapper.py", line 97, in run_inference
output = model(**inference_data.data)
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sissler/.cache/torch/hub/checkpoints/model_angelo_v1.0/nucleotides/gnn/model.py", line 17, in forward
return self.ipa(*args, **kwargs)
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/model_angelo/gnn/multi_layer_ipa.py", line 174, in forward
) = self.seq_attentions[idx](
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/model_angelo/gnn/sequence_attention.py", line 150, in forward_checkpoint
return torch.utils.checkpoint.checkpoint(
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/torch/_compile.py", line 24, in inner
return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
return fn(*args, **kwargs)
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 458, in checkpoint
ret = function(*args, **kwargs)
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/model_angelo/gnn/sequence_attention.py", line 201, in _intern_forward
sequence_attention_weights = padded_sequence_softmax(
File "/home/sissler/miniconda3/envs/model_angelo/lib/python3.10/site-packages/model_angelo/utils/torch_utils.py", line 336, in padded_sequence_softmax
padded_softmax = torch.softmax(padded_sequence_values, dim=dim)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 614.00 MiB. GPU 1 has a total capacty of 10.76 GiB of which 534.56 MiB is free. Process 21144 has 1.64 GiB memory in use. Including non-PyTorch memory, this process has 7.95 GiB memory in use. Of the allocated memory 6.64 GiB is allocated by PyTorch, and 254.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant