-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA error out of memory #29
Comments
I have a similar but different error now. |
Just wanted to add I am also currently getting this issue, I am trying to run PDB:1aqf (tetramer complex) I am running on the Ohio Super Computers Ascend cluster, which uses NVIDIA A100 80 GB gpus, here is the error I am getting: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.88 GiB (GPU 0; 79.15 GiB total capacity; 46.56 GiB already allocated; 32.00 GiB free; 46.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF~ Has anyone figured out a solution for this? I tried setting max_split_size_mb but no luck |
I run following code and updated the torch. After It ran normally |
Hello,
I've been folding single amino-acid substitute protein variants that are part of a heterotetramer complex. See the attached FASTA for an example fold query (SGCB_A9G.fa.txt).
This is the command to request the fold: /path/to/RosettaFold2/run_RF2.sh $fasta_file_location -o $output_directory --pair
3/4 of the models will generate but the last 1/4 will error out with a memory issue. See the Error logs for the traceback details.
RF2_Job336369_99.out.txt
RF2_Job336369_99.err.txt
The computing environment is IBM's LSF. The requested nodes have 64GB of RAM with a single TeslaV100_SXM2_32GB . Seems like the RMA doesn't cap out beyond 22GB. I don't have insight on the GPU utilization.
Is there anything I can do on my end? Can the code be fixed to deal with this issue? My temporary solution is to re-run fails jobs but this is not ideal.
Best,
Lloyd Tripp
The text was updated successfully, but these errors were encountered: