Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error out of memory #29

Open
lloydtripp opened this issue Feb 19, 2024 · 3 comments
Open

CUDA error out of memory #29

lloydtripp opened this issue Feb 19, 2024 · 3 comments

Comments

@lloydtripp
Copy link

Hello,

I've been folding single amino-acid substitute protein variants that are part of a heterotetramer complex. See the attached FASTA for an example fold query (SGCB_A9G.fa.txt).

This is the command to request the fold: /path/to/RosettaFold2/run_RF2.sh $fasta_file_location -o $output_directory --pair

3/4 of the models will generate but the last 1/4 will error out with a memory issue. See the Error logs for the traceback details.
RF2_Job336369_99.out.txt
RF2_Job336369_99.err.txt

The computing environment is IBM's LSF. The requested nodes have 64GB of RAM with a single TeslaV100_SXM2_32GB . Seems like the RMA doesn't cap out beyond 22GB. I don't have insight on the GPU utilization.

Is there anything I can do on my end? Can the code be fixed to deal with this issue? My temporary solution is to re-run fails jobs but this is not ideal.

Best,
Lloyd Tripp

@lloydtripp
Copy link
Author

I have a similar but different error now.

RF2-preMSA_Job557425_202.err.txt

@robert-bolz
Copy link

Just wanted to add I am also currently getting this issue, I am trying to run PDB:1aqf (tetramer complex)

I am running on the Ohio Super Computers Ascend cluster, which uses NVIDIA A100 80 GB gpus, here is the error I am getting:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.88 GiB (GPU 0; 79.15 GiB total capacity; 46.56 GiB already allocated; 32.00 GiB free; 46.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF~

Has anyone figured out a solution for this? I tried setting max_split_size_mb but no luck

@Khas-Erdene-1
Copy link

Just wanted to add I am also currently getting this issue, I am trying to run PDB:1aqf (tetramer complex)

I am running on the Ohio Super Computers Ascend cluster, which uses NVIDIA A100 80 GB gpus, here is the error I am getting:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.88 GiB (GPU 0; 79.15 GiB total capacity; 46.56 GiB already allocated; 32.00 GiB free; 46.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF~

Has anyone figured out a solution for this? I tried setting max_split_size_mb but no luck

I run following code and updated the torch. After It ran normally
pip install torch torchvision torchaudio

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants