CUDA error out of memory #29

lloydtripp · 2024-02-19T19:47:27Z

Hello,

I've been folding single amino-acid substitute protein variants that are part of a heterotetramer complex. See the attached FASTA for an example fold query (SGCB_A9G.fa.txt).

This is the command to request the fold: /path/to/RosettaFold2/run_RF2.sh $fasta_file_location -o $output_directory --pair

3/4 of the models will generate but the last 1/4 will error out with a memory issue. See the Error logs for the traceback details.
RF2_Job336369_99.out.txt
RF2_Job336369_99.err.txt

The computing environment is IBM's LSF. The requested nodes have 64GB of RAM with a single TeslaV100_SXM2_32GB . Seems like the RMA doesn't cap out beyond 22GB. I don't have insight on the GPU utilization.

Is there anything I can do on my end? Can the code be fixed to deal with this issue? My temporary solution is to re-run fails jobs but this is not ideal.

Best,
Lloyd Tripp

lloydtripp · 2024-02-22T16:59:34Z

I have a similar but different error now.

RF2-preMSA_Job557425_202.err.txt

robert-bolz · 2024-05-03T17:24:34Z

Just wanted to add I am also currently getting this issue, I am trying to run PDB:1aqf (tetramer complex)

I am running on the Ohio Super Computers Ascend cluster, which uses NVIDIA A100 80 GB gpus, here is the error I am getting:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.88 GiB (GPU 0; 79.15 GiB total capacity; 46.56 GiB already allocated; 32.00 GiB free; 46.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF~

Has anyone figured out a solution for this? I tried setting max_split_size_mb but no luck

Khas-Erdene-1 · 2024-06-18T05:44:33Z

Just wanted to add I am also currently getting this issue, I am trying to run PDB:1aqf (tetramer complex)

I am running on the Ohio Super Computers Ascend cluster, which uses NVIDIA A100 80 GB gpus, here is the error I am getting:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.88 GiB (GPU 0; 79.15 GiB total capacity; 46.56 GiB already allocated; 32.00 GiB free; 46.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF~

Has anyone figured out a solution for this? I tried setting max_split_size_mb but no luck

I run following code and updated the torch. After It ran normally
pip install torch torchvision torchaudio

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error out of memory #29

CUDA error out of memory #29

lloydtripp commented Feb 19, 2024

lloydtripp commented Feb 22, 2024

robert-bolz commented May 3, 2024

Khas-Erdene-1 commented Jun 18, 2024

CUDA error out of memory #29

CUDA error out of memory #29

Comments

lloydtripp commented Feb 19, 2024

lloydtripp commented Feb 22, 2024

robert-bolz commented May 3, 2024

Khas-Erdene-1 commented Jun 18, 2024