Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spIsoNet can't run as relion_external_reconstruct #3

Open
donghuachensu opened this issue Apr 16, 2024 · 16 comments
Open

spIsoNet can't run as relion_external_reconstruct #3

donghuachensu opened this issue Apr 16, 2024 · 16 comments
Assignees

Comments

@donghuachensu
Copy link

Hi Yun-Tao,

I got the following errors when I was using spIsoNet as external_reconstruct in Relion4 or Relion5-beta. Any suggestions? Thanks!

The following warnings were encountered upon command-line parsing:
WARNING: Option --keep_lowres is not a valid RELION argument
04-15 19:26:44, INFO voxel_size 1.399999976158142
04-15 19:26:45, INFO voxel_size 1.399999976158142
04-15 19:39:35, INFO Limit resolution to 10.0 for spIsoNet 3D FSC calculation. You can also tune this paramerter with --limit_res .
04-15 19:39:35, INFO calculating fast 3DFSC, this will take few minutes

04-15 19:43:02, INFO voxel_size 1.399999976158142
04-15 19:51:23, INFO The Refine3D/job025 folder already exists, outputs will write into this folder
04-15 19:51:27, INFO voxel_size 1.399999976158142
04-15 19:51:32, INFO spIsoNet correction until resolution 10.0A!
Information beyond 10.0A remains unchanged
04-15 19:54:34, INFO Start preparing subvolumes!
04-15 19:54:59, INFO Done preparing subvolumes!
04-15 19:54:59, INFO Start training!
04-15 19:55:02, INFO Port number: 45495
[W socket.cpp:464] [c10d] The server socket cannot be initialized on [::]:45495 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol).
Traceback (most recent call last):
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/bin/spisonet.py", line 8, in
sys.exit(main())
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 549, in main
fire.Fire(ISONET)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 182, in reconstruct
map_refine_n2n(halfmap1,halfmap2, mask_vol, fsc3d, alpha = alpha,beta=beta, voxel_size=voxel_size, output_dir=output_dir,
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/map_refine.py", line 145, in map_refine_n2n
network.train([data_dir_1,data_dir_2], output_dir, alpha=alpha,beta=beta, output_base=output_base0, batch_size=batch_size, epochs = epochs, steps_per_epoch = 1000,
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 265, in train
mp.spawn(ddp_train, args=(self.world_size, self.port_number, self.model,alpha,beta,
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 158, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 68, in _wrap
fn(i, *args)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 50, in ddp_train
model = model.cuda()
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 911, in cuda
return self._apply(lambda t: t.cuda(device))
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply
param_applied = fn(param)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 911, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Traceback (most recent call last):
File "/home/groups/kornberg/donghuac/spIsoNet/spIsoNet/bin/relion_wrapper.py", line 517, in
with mrcfile.open(mrc1_cor) as d1:
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/mrcfile/load_functions.py", line 139, in open
return NewMrc(name, mode=mode, permissive=permissive,
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/mrcfile/mrcfile.py", line 109, in init
self._open_file(name)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/mrcfile/mrcfile.py", line 126, in _open_file
self._iostream = open(name, self._mode + 'b')
FileNotFoundError: [Errno 2] No such file or directory: 'Refine3D/job025/corrected_run_it001_half1_class001_unfil.mrc'
in: /home/groups/kornberg/donghuac/relion/src/backprojector.cpp, line 1323
ERROR:
ERROR: there was something wrong with system call: python /home/groups/kornberg/donghuac/spIsoNet/spIsoNet/bin/relion_wrapper.py Refine3D/job025/run_it001_half1_class001_external_reconstruct.star
=== Backtrace ===
/home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x69) [0x4e6a59]
/home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi() [0x46434e]
/home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi(_ZN14MlOptimiserMpi12maximizationEv+0x1b02) [0x5222b2]
/home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x3e9) [0x523279]
/home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi(main+0x55) [0x4d47b5]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x7ff6f5fbb555]
/home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi() [0x4d805e]

ERROR:
ERROR: there was something wrong with system call: python /home/groups/kornberg/donghuac/spIsoNet/spIsoNet/bin/relion_wrapper.py Refine3D/job025/run_it001_half1_class001_external_reconstruct.star

MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

@procyontao
Copy link
Collaborator

Hi,

I tested relion4 but not relion5, but we think the relion5 blush regularization shares similarities with spIsoNet denoising.

This problem is probably related to failing to open a port, which is 45495 in your case. spIsoNet will automatically detect a port that is not been used for communication. If Anisotropy correction for half maps can be executed correctly, this RELION embedded spIsoNet should also work.

What I have in mind is to check what differs between the environment when you are running "spisonet.py reconstruct" and the relion wrapper. Such as whether the correct conda is used, or whether there are firewall problems.

@donghuachensu
Copy link
Author

donghuachensu commented Apr 16, 2024

Hi,

Thanks for the reply. I also tested spIsoNet in Relion4 on my workstation (the previous one run on a cluster). Here is the error. Please take a look.

The following warnings were encountered upon command-line parsing:
WARNING: Option --keep_lowres is not a valid RELION argument
04-16 14:59:29, INFO voxel_size 1.399999976158142
04-16 14:59:29, INFO voxel_size 1.399999976158142
04-16 15:04:45, INFO Limit resolution to 10.0 for spIsoNet 3D FSC calculation. You can also tune this paramerter with --limit_res .
04-16 15:04:45, INFO calculating fast 3DFSC, this will take few minutes

04-16 15:06:31, INFO voxel_size 1.399999976158142
04-16 15:09:55, INFO The Refine3D/job025 folder already exists, outputs will write into this folder
04-16 15:09:57, INFO voxel_size 1.399999976158142
04-16 15:10:00, INFO spIsoNet correction until resolution 10.0A!
Information beyond 10.0A remains unchanged
04-16 15:11:06, INFO Start preparing subvolumes!
04-16 15:11:24, INFO Done preparing subvolumes!
04-16 15:11:24, INFO Start training!
04-16 15:11:24, INFO Port number: 44689
[rank3]:[E ProcessGroupNCCL.cpp:523] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600701 milliseconds before timing out.
[rank3]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E ProcessGroupNCCL.cpp:1182] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600701 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1711403388920/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7efd44c15d87 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7efd45daed26 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7efd45db227d in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7efd45db2e79 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7efda2973bf4 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: + 0x9f802 (0x7efdb898d802 in /lib64/libc.so.6)
frame #6: + 0x3f450 (0x7efdb892d450 in /lib64/libc.so.6)

[rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600714 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600714 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1711403388920/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f55840bcd87 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f5585255d26 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f558525927d in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f5585259e79 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7f55e1e1abf4 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: + 0x9f802 (0x7f55f7e34802 in /lib64/libc.so.6)
frame #6: + 0x3f450 (0x7f55f7dd4450 in /lib64/libc.so.6)

[rank2]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600837 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1182] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600837 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1711403388920/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8aca5aed87 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f8acb747d26 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f8acb74b27d in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f8acb74be79 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7f8b2830cbf4 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: + 0x9f802 (0x7f8b3e326802 in /lib64/libc.so.6)
frame #6: + 0x3f450 (0x7f8b3e2c6450 in /lib64/libc.so.6)

Traceback (most recent call last):
File "/data/donghua/anaconda3/envs/spisonet/bin/spisonet.py", line 8, in
sys.exit(main())
File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 549, in main
fire.Fire(ISONET)
File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 182, in reconstruct
map_refine_n2n(halfmap1,halfmap2, mask_vol, fsc3d, alpha = alpha,beta=beta, voxel_size=voxel_size, output_dir=output_dir,
File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/map_refine.py", line 145, in map_refine_n2n
network.train([data_dir_1,data_dir_2], output_dir, alpha=alpha,beta=beta, output_base=output_base0, batch_size=batch_size, epochs = epochs, steps_per_epoch = 1000,
File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 265, in train
mp.spawn(ddp_train, args=(self.world_size, self.port_number, self.model,alpha,beta,
File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 3 terminated with signal SIGABRT
Traceback (most recent call last):
File "/data/donghua/spIsoNet/spIsoNet/bin/relion_wrapper.py", line 517, in
with mrcfile.open(mrc1_cor) as d1:
File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/mrcfile/load_functions.py", line 139, in open
return NewMrc(name, mode=mode, permissive=permissive,
File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/mrcfile/mrcfile.py", line 109, in init
self._open_file(name)
File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/mrcfile/mrcfile.py", line 126, in _open_file
self._iostream = open(name, self._mode + 'b')
FileNotFoundError: [Errno 2] No such file or directory: 'Refine3D/job025/corrected_run_it001_half1_class001_unfil.mrc'
in: /data/donghua/relion/src/backprojector.cpp, line 1294
ERROR:
ERROR: there was something wrong with system call: python /data/donghua/spIsoNet/spIsoNet/bin/relion_wrapper.py Refine3D/job025/run_it001_half1_class001_external_reconstruct.star
=== Backtrace ===
/data/donghua/relion/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x69) [0x4be749]
/data/donghua/relion/bin/relion_refine_mpi() [0x44d378]
/data/donghua/relion/bin/relion_refine_mpi(_ZN14MlOptimiserMpi12maximizationEv+0x17a4) [0x4f4b14]
/data/donghua/relion/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x4c1) [0x4f5b01]
/data/donghua/relion/bin/relion_refine_mpi(main+0x58) [0x4ad658]
/lib64/libc.so.6(+0x3feb0) [0x7f8469dbfeb0]
/lib64/libc.so.6(__libc_start_main+0x80) [0x7f8469dbff60]
/data/donghua/relion/bin/relion_refine_mpi(_start+0x25) [0x4b0815]

ERROR:
ERROR: there was something wrong with system call: python /data/donghua/spIsoNet/spIsoNet/bin/relion_wrapper.py Refine3D/job025/run_it001_half1_class001_external_reconstruct.star

MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

@donghuachensu
Copy link
Author

Hi, are the above two errors the same? I got one from a cluster and another one from the workstation. Any suggestions? Thanks!

@procyontao
Copy link
Collaborator

Hi,

I now know that the "MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1...." problem is RELION reconstruction not properly performed. This will happen as spIsoNet does not works for RELION5.

I still does not have any understanding how the NCCL related error happens.

@procyontao
Copy link
Collaborator

procyontao commented Apr 18, 2024

Again I want to confirm whether the Anisotropy Correction ("spisonet.py reconstruct") gives you these errors within the same environment.

@procyontao procyontao self-assigned this Apr 18, 2024
@procyontao
Copy link
Collaborator

Please also see whether this "#2" is related

@donghuachensu
Copy link
Author

I can confirm that the Anisotropy Correction worked without any error on my 2-GPU workstation which has the same type of GPU as my 4-GPU workstation where I got the second error above.

@olibclarke
Copy link

I now know that the "MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1...." problem is RELION reconstruction not properly performed. This will happen as spIsoNet does not works for RELION5.

Is relion 5 compatibility on the roadmap? Or for now, would you recommend to set up a separate installation of relion 4 for misalignment correction?

@DanGonite57
Copy link

I now know that the "MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1...." problem is RELION reconstruction not properly performed. This will happen as spIsoNet does not works for RELION5.

Hi, are you sure that it does not work with RELION-5? Is this a matter of spIsoNet not running at all in RELION-5, or not giving the intended output? Because I have been running it through RELION-5 for initial testing before seeing this comment and it doesn't appear to have any issues, but I can't speak for whether it is producing "correct" results.

@olibclarke
Copy link

olibclarke commented Apr 18, 2024

In our hands with relion 5 it seems to run, but the unfil.mrc and corrected.mrc maps are blank, leading to a crash after one iteration - this does not happen without --external_reconstruct. I haven't tried relion 4 yet

EDIT:
What does seem to work in relion 5 is the following: Run a few iterations without --external_reconstruct. Kill the refinement, then continue the refinement from the last _optimiser.star, adding in the --external_reconstruct flag. Just tried this and it seems to work, and generates normal-looking external reconstruction volumes (can't verify yet whether it is helping!). Also it only seems to work if run in the spisonet conda env.

EDIT2:

Scratch that, I don't think it is actually doing anything. Here is the log:

 + Making system call for external reconstruction: python /home/user/software/spIsoNet/build/lib/spIsoNet/bin/relion_wrapper.py Refine3D/job012/run_it008_half1_class001_external_reconstruct.star
iter = 008
set CUDA_VISIBLE_DEVICES=None
set CONDA_ENV=spisonet
set ISONET_WHITENING=True
set ISONET_WHITENING_LOW=10
set ISONET_RETRAIN_EACH_ITER=True
set ISONET_BETA=0.5
set ISONET_ALPHA=1
set ISONET_START_HEALPIX=3
set ISONET_ACC_BATCHES=2
set ISONET_EPOCHS=5
set ISONET_KEEP_LOWRES=False
set ISONET_LOWPASS=True
set ISONET_ANGULAR_WHITEN=False
set ISONET_3DFSD=False
set ISONET_FSC_05=False
set ISONET_FSC_WEIGHTING=True
set ISONET_START_RESOLUTION=15.0
set ISONET_KEEP_LOWRES= False
healpix = 2
symmetry = C1
mask_file = mask.mrc
pixel size = 1.125
resolution at 0.5 and 0.143 are 7.384615 and 5.538462
real limit resolution to 5.538462
 + External reconstruction finished successfully, reading result back in ...

It seemingly runs and reconstructs, but never trains a model...

EDIT 3:

Nope, it is working - it just hadn't reached fine enough angular sampling. Working now. One thing I notice though - it defaults to using all GPUs - it would be better if somehow it could default to using the GPUs that have been assigned to this job in Relion (not sure if that is possible?)

@donghuachensu
Copy link
Author

donghuachensu commented May 2, 2024

Hi All,

I used the recommended option 1 (as the following) in the tutorial to do the installation of spIsoNet, why I could not see the bin directory (which should contain the program spisonet.py) created just under the directory of spIsoNet after the installation?

git clone https://github.com/IsoNet-cryoET/spIsoNet.git
conda env create -f setup.yml
conda activate spisonet

Or just copy all the files in ~/spIsoNet/spIsoNet/bin/*.py to ~/spIsoNet/bin?
Any suggestions? Thanks!

@procyontao
Copy link
Collaborator

Hi,

The correct path is actually ~/spIsoNet/spIsoNet/bin/spisonet.py. All codes reside within the ~/spIsoNet/spIsoNet directory, so there's no need to move any files.

Hi All,

I used the recommended option 1 (as the following) in the tutorial to do the installation of spIsoNet, why I could not see the bin directory (which should contain the program spisonet.py) created just under the directory of spIsoNet after the installation?

git clone https://github.com/IsoNet-cryoET/spIsoNet.git conda env create -f setup.yml conda activate spisonet

Or just copy all the files in ~/spIsoNet/spIsoNet/bin/*.py to ~/spIsoNet/bin? Any suggestions? Thanks!

@donghuachensu
Copy link
Author

Thank you for your clarification!

@donghuachensu
Copy link
Author

donghuachensu commented May 24, 2024

I found that the first error above {RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable} was related to the GPU node's setting for Compute Mode of Default (e.g. on our cluster: #SBATCH --gpu_cmode=shared), and the second error above (Some NCCL operations have failed or timed out) was corrected by this setting (export NCCL_P2P_DISABLE=1).

@procyontao
Copy link
Collaborator

Thank you for trouble shooting and report back

@donghuachensu
Copy link
Author

I wonder in this file spIsoNet_v1.0_Tutorial.pdf, whether one more step (pip install .) in Option 3 for the Installation should be added as the last step? Please confirm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants