spIsoNet can't run as relion_external_reconstruct #3

donghuachensu · 2024-04-16T04:48:54Z

Hi Yun-Tao,

I got the following errors when I was using spIsoNet as external_reconstruct in Relion4 or Relion5-beta. Any suggestions? Thanks!

The following warnings were encountered upon command-line parsing:
WARNING: Option --keep_lowres is not a valid RELION argument
04-15 19:26:44, INFO voxel_size 1.399999976158142
04-15 19:26:45, INFO voxel_size 1.399999976158142
04-15 19:39:35, INFO Limit resolution to 10.0 for spIsoNet 3D FSC calculation. You can also tune this paramerter with --limit_res .
04-15 19:39:35, INFO calculating fast 3DFSC, this will take few minutes

04-15 19:43:02, INFO voxel_size 1.399999976158142
04-15 19:51:23, INFO The Refine3D/job025 folder already exists, outputs will write into this folder
04-15 19:51:27, INFO voxel_size 1.399999976158142
04-15 19:51:32, INFO spIsoNet correction until resolution 10.0A!
Information beyond 10.0A remains unchanged
04-15 19:54:34, INFO Start preparing subvolumes!
04-15 19:54:59, INFO Done preparing subvolumes!
04-15 19:54:59, INFO Start training!
04-15 19:55:02, INFO Port number: 45495
[W socket.cpp:464] [c10d] The server socket cannot be initialized on [::]:45495 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol).
Traceback (most recent call last):
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/bin/spisonet.py", line 8, in
sys.exit(main())
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 549, in main
fire.Fire(ISONET)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 182, in reconstruct
map_refine_n2n(halfmap1,halfmap2, mask_vol, fsc3d, alpha = alpha,beta=beta, voxel_size=voxel_size, output_dir=output_dir,
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/map_refine.py", line 145, in map_refine_n2n
network.train([data_dir_1,data_dir_2], output_dir, alpha=alpha,beta=beta, output_base=output_base0, batch_size=batch_size, epochs = epochs, steps_per_epoch = 1000,
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 265, in train
mp.spawn(ddp_train, args=(self.world_size, self.port_number, self.model,alpha,beta,
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 158, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 68, in _wrap
fn(i, *args)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 50, in ddp_train
model = model.cuda()
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 911, in cuda
return self._apply(lambda t: t.cuda(device))
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply
param_applied = fn(param)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 911, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Traceback (most recent call last):
File "/home/groups/kornberg/donghuac/spIsoNet/spIsoNet/bin/relion_wrapper.py", line 517, in
with mrcfile.open(mrc1_cor) as d1:
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/mrcfile/load_functions.py", line 139, in open
return NewMrc(name, mode=mode, permissive=permissive,
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/mrcfile/mrcfile.py", line 109, in init
self._open_file(name)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/mrcfile/mrcfile.py", line 126, in _open_file
self._iostream = open(name, self._mode + 'b')
FileNotFoundError: [Errno 2] No such file or directory: 'Refine3D/job025/corrected_run_it001_half1_class001_unfil.mrc'
in: /home/groups/kornberg/donghuac/relion/src/backprojector.cpp, line 1323
ERROR:
ERROR: there was something wrong with system call: python /home/groups/kornberg/donghuac/spIsoNet/spIsoNet/bin/relion_wrapper.py Refine3D/job025/run_it001_half1_class001_external_reconstruct.star
=== Backtrace ===
/home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x69) [0x4e6a59]
/home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi() [0x46434e]
/home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi(_ZN14MlOptimiserMpi12maximizationEv+0x1b02) [0x5222b2]
/home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x3e9) [0x523279]
/home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi(main+0x55) [0x4d47b5]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x7ff6f5fbb555]
/home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi() [0x4d805e]

ERROR:
ERROR: there was something wrong with system call: python /home/groups/kornberg/donghuac/spIsoNet/spIsoNet/bin/relion_wrapper.py Refine3D/job025/run_it001_half1_class001_external_reconstruct.star

MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

The text was updated successfully, but these errors were encountered:

procyontao · 2024-04-16T19:05:07Z

Hi,

I tested relion4 but not relion5, but we think the relion5 blush regularization shares similarities with spIsoNet denoising.

This problem is probably related to failing to open a port, which is 45495 in your case. spIsoNet will automatically detect a port that is not been used for communication. If Anisotropy correction for half maps can be executed correctly, this RELION embedded spIsoNet should also work.

What I have in mind is to check what differs between the environment when you are running "spisonet.py reconstruct" and the relion wrapper. Such as whether the correct conda is used, or whether there are firewall problems.

donghuachensu · 2024-04-16T21:24:55Z

Hi,

Thanks for the reply. I also tested spIsoNet in Relion4 on my workstation (the previous one run on a cluster). Here is the error. Please take a look.

The following warnings were encountered upon command-line parsing:
WARNING: Option --keep_lowres is not a valid RELION argument
04-16 14:59:29, INFO voxel_size 1.399999976158142
04-16 14:59:29, INFO voxel_size 1.399999976158142
04-16 15:04:45, INFO Limit resolution to 10.0 for spIsoNet 3D FSC calculation. You can also tune this paramerter with --limit_res .
04-16 15:04:45, INFO calculating fast 3DFSC, this will take few minutes

04-16 15:06:31, INFO voxel_size 1.399999976158142
04-16 15:09:55, INFO The Refine3D/job025 folder already exists, outputs will write into this folder
04-16 15:09:57, INFO voxel_size 1.399999976158142
04-16 15:10:00, INFO spIsoNet correction until resolution 10.0A!
Information beyond 10.0A remains unchanged
04-16 15:11:06, INFO Start preparing subvolumes!
04-16 15:11:24, INFO Done preparing subvolumes!
04-16 15:11:24, INFO Start training!
04-16 15:11:24, INFO Port number: 44689
[rank3]:[E ProcessGroupNCCL.cpp:523] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600701 milliseconds before timing out.
[rank3]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E ProcessGroupNCCL.cpp:1182] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600701 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1711403388920/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7efd44c15d87 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7efd45daed26 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7efd45db227d in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7efd45db2e79 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7efda2973bf4 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: + 0x9f802 (0x7efdb898d802 in /lib64/libc.so.6)
frame #6: + 0x3f450 (0x7efdb892d450 in /lib64/libc.so.6)

[rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600714 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600714 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1711403388920/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f55840bcd87 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f5585255d26 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f558525927d in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f5585259e79 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7f55e1e1abf4 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: + 0x9f802 (0x7f55f7e34802 in /lib64/libc.so.6)
frame #6: + 0x3f450 (0x7f55f7dd4450 in /lib64/libc.so.6)

[rank2]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600837 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1182] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600837 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1711403388920/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8aca5aed87 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f8acb747d26 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f8acb74b27d in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f8acb74be79 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7f8b2830cbf4 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: + 0x9f802 (0x7f8b3e326802 in /lib64/libc.so.6)
frame #6: + 0x3f450 (0x7f8b3e2c6450 in /lib64/libc.so.6)

Traceback (most recent call last):
File "/data/donghua/anaconda3/envs/spisonet/bin/spisonet.py", line 8, in
sys.exit(main())
File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 549, in main
fire.Fire(ISONET)
File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, kwargs)
File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 182, in reconstruct
map_refine_n2n(halfmap1,halfmap2, mask_vol, fsc3d, alpha = alpha,beta=beta, voxel_size=voxel_size, output_dir=output_dir,
File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/map_refine.py", line 145, in map_refine_n2n
network.train([data_dir_1,data_dir_2], output_dir, alpha=alpha,beta=beta, output_base=output_base0, batch_size=batch_size, epochs = epochs, steps_per_epoch = 1000,
File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 265, in train
mp.spawn(ddp_train, args=(self.world_size, self.port_number, self.model,alpha,beta,
File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 3 terminated with signal SIGABRT
Traceback (most recent call last):
File "/data/donghua/spIsoNet/spIsoNet/bin/relion_wrapper.py", line 517, in
with mrcfile.open(mrc1_cor) as d1:
File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/mrcfile/load_functions.py", line 139, in open
return NewMrc(name, mode=mode, permissive=permissive,
File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/mrcfile/mrcfile.py", line 109, in init**
self._open_file(name)
File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/mrcfile/mrcfile.py", line 126, in _open_file
self._iostream = open(name, self._mode + 'b')
FileNotFoundError: [Errno 2] No such file or directory: 'Refine3D/job025/corrected_run_it001_half1_class001_unfil.mrc'
in: /data/donghua/relion/src/backprojector.cpp, line 1294
ERROR:
ERROR: there was something wrong with system call: python /data/donghua/spIsoNet/spIsoNet/bin/relion_wrapper.py Refine3D/job025/run_it001_half1_class001_external_reconstruct.star
=== Backtrace ===
/data/donghua/relion/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x69) [0x4be749]
/data/donghua/relion/bin/relion_refine_mpi() [0x44d378]
/data/donghua/relion/bin/relion_refine_mpi(_ZN14MlOptimiserMpi12maximizationEv+0x17a4) [0x4f4b14]
/data/donghua/relion/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x4c1) [0x4f5b01]
/data/donghua/relion/bin/relion_refine_mpi(main+0x58) [0x4ad658]
/lib64/libc.so.6(+0x3feb0) [0x7f8469dbfeb0]
/lib64/libc.so.6(__libc_start_main+0x80) [0x7f8469dbff60]
/data/donghua/relion/bin/relion_refine_mpi(_start+0x25) [0x4b0815]

ERROR:
ERROR: there was something wrong with system call: python /data/donghua/spIsoNet/spIsoNet/bin/relion_wrapper.py Refine3D/job025/run_it001_half1_class001_external_reconstruct.star

MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

donghuachensu · 2024-04-17T22:25:05Z

Hi, are the above two errors the same? I got one from a cluster and another one from the workstation. Any suggestions? Thanks!

procyontao · 2024-04-18T00:47:32Z

Hi,

I now know that the "MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1...." problem is RELION reconstruction not properly performed. This will happen as spIsoNet does not works for RELION5.

I still does not have any understanding how the NCCL related error happens.

procyontao · 2024-04-18T00:51:09Z

Again I want to confirm whether the Anisotropy Correction ("spisonet.py reconstruct") gives you these errors within the same environment.

procyontao · 2024-04-18T01:21:41Z

Please also see whether this "#2" is related

donghuachensu · 2024-04-18T01:39:53Z

I can confirm that the Anisotropy Correction worked without any error on my 2-GPU workstation which has the same type of GPU as my 4-GPU workstation where I got the second error above.

olibclarke · 2024-04-18T12:37:44Z

I now know that the "MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1...." problem is RELION reconstruction not properly performed. This will happen as spIsoNet does not works for RELION5.

Is relion 5 compatibility on the roadmap? Or for now, would you recommend to set up a separate installation of relion 4 for misalignment correction?

DanGonite57 · 2024-04-18T15:11:33Z

I now know that the "MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1...." problem is RELION reconstruction not properly performed. This will happen as spIsoNet does not works for RELION5.

Hi, are you sure that it does not work with RELION-5? Is this a matter of spIsoNet not running at all in RELION-5, or not giving the intended output? Because I have been running it through RELION-5 for initial testing before seeing this comment and it doesn't appear to have any issues, but I can't speak for whether it is producing "correct" results.

olibclarke · 2024-04-18T19:49:57Z

In our hands with relion 5 it seems to run, but the unfil.mrc and corrected.mrc maps are blank, leading to a crash after one iteration - this does not happen without --external_reconstruct. I haven't tried relion 4 yet

EDIT:
What does seem to work in relion 5 is the following: Run a few iterations without --external_reconstruct. Kill the refinement, then continue the refinement from the last _optimiser.star, adding in the --external_reconstruct flag. Just tried this and it seems to work, and generates normal-looking external reconstruction volumes (can't verify yet whether it is helping!). Also it only seems to work if run in the spisonet conda env.

EDIT2:

Scratch that, I don't think it is actually doing anything. Here is the log:

 + Making system call for external reconstruction: python /home/user/software/spIsoNet/build/lib/spIsoNet/bin/relion_wrapper.py Refine3D/job012/run_it008_half1_class001_external_reconstruct.star
iter = 008
set CUDA_VISIBLE_DEVICES=None
set CONDA_ENV=spisonet
set ISONET_WHITENING=True
set ISONET_WHITENING_LOW=10
set ISONET_RETRAIN_EACH_ITER=True
set ISONET_BETA=0.5
set ISONET_ALPHA=1
set ISONET_START_HEALPIX=3
set ISONET_ACC_BATCHES=2
set ISONET_EPOCHS=5
set ISONET_KEEP_LOWRES=False
set ISONET_LOWPASS=True
set ISONET_ANGULAR_WHITEN=False
set ISONET_3DFSD=False
set ISONET_FSC_05=False
set ISONET_FSC_WEIGHTING=True
set ISONET_START_RESOLUTION=15.0
set ISONET_KEEP_LOWRES= False
healpix = 2
symmetry = C1
mask_file = mask.mrc
pixel size = 1.125
resolution at 0.5 and 0.143 are 7.384615 and 5.538462
real limit resolution to 5.538462
 + External reconstruction finished successfully, reading result back in ...

It seemingly runs and reconstructs, but never trains a model...

EDIT 3:

Nope, it is working - it just hadn't reached fine enough angular sampling. Working now. One thing I notice though - it defaults to using all GPUs - it would be better if somehow it could default to using the GPUs that have been assigned to this job in Relion (not sure if that is possible?)

donghuachensu · 2024-05-02T22:19:18Z

Hi All,

I used the recommended option 1 (as the following) in the tutorial to do the installation of spIsoNet, why I could not see the bin directory (which should contain the program spisonet.py) created just under the directory of spIsoNet after the installation?

git clone https://github.com/IsoNet-cryoET/spIsoNet.git
conda env create -f setup.yml
conda activate spisonet

Or just copy all the files in ~/spIsoNet/spIsoNet/bin/*.py to ~/spIsoNet/bin?
Any suggestions? Thanks!

procyontao · 2024-05-06T18:50:05Z

Hi,

The correct path is actually ~/spIsoNet/spIsoNet/bin/spisonet.py. All codes reside within the ~/spIsoNet/spIsoNet directory, so there's no need to move any files.

Hi All,

I used the recommended option 1 (as the following) in the tutorial to do the installation of spIsoNet, why I could not see the bin directory (which should contain the program spisonet.py) created just under the directory of spIsoNet after the installation?

git clone https://github.com/IsoNet-cryoET/spIsoNet.git conda env create -f setup.yml conda activate spisonet

Or just copy all the files in ~/spIsoNet/spIsoNet/bin/*.py to ~/spIsoNet/bin? Any suggestions? Thanks!

donghuachensu · 2024-05-06T19:17:28Z

Thank you for your clarification!

donghuachensu · 2024-05-24T01:30:09Z

I found that the first error above {RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable} was related to the GPU node's setting for Compute Mode of Default (e.g. on our cluster: #SBATCH --gpu_cmode=shared), and the second error above (Some NCCL operations have failed or timed out) was corrected by this setting (export NCCL_P2P_DISABLE=1).

procyontao · 2024-05-24T03:05:52Z

Thank you for trouble shooting and report back

donghuachensu · 2024-05-24T05:26:12Z

I wonder in this file spIsoNet_v1.0_Tutorial.pdf, whether one more step (pip install .) in Option 3 for the Installation should be added as the last step? Please confirm.

procyontao self-assigned this Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spIsoNet can't run as relion_external_reconstruct #3

spIsoNet can't run as relion_external_reconstruct #3

donghuachensu commented Apr 16, 2024

procyontao commented Apr 16, 2024

donghuachensu commented Apr 16, 2024 •

edited

Loading

donghuachensu commented Apr 17, 2024

procyontao commented Apr 18, 2024

procyontao commented Apr 18, 2024 •

edited

Loading

procyontao commented Apr 18, 2024

donghuachensu commented Apr 18, 2024

olibclarke commented Apr 18, 2024

DanGonite57 commented Apr 18, 2024

olibclarke commented Apr 18, 2024 •

edited

Loading

donghuachensu commented May 2, 2024 •

edited

Loading

procyontao commented May 6, 2024

donghuachensu commented May 6, 2024

donghuachensu commented May 24, 2024 •

edited

Loading

procyontao commented May 24, 2024

donghuachensu commented May 24, 2024

spIsoNet can't run as relion_external_reconstruct #3

spIsoNet can't run as relion_external_reconstruct #3

Comments

donghuachensu commented Apr 16, 2024

ERROR: ERROR: there was something wrong with system call: python /home/groups/kornberg/donghuac/spIsoNet/spIsoNet/bin/relion_wrapper.py Refine3D/job025/run_it001_half1_class001_external_reconstruct.star

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

procyontao commented Apr 16, 2024

donghuachensu commented Apr 16, 2024 • edited Loading

ERROR: ERROR: there was something wrong with system call: python /data/donghua/spIsoNet/spIsoNet/bin/relion_wrapper.py Refine3D/job025/run_it001_half1_class001_external_reconstruct.star

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

donghuachensu commented Apr 17, 2024

procyontao commented Apr 18, 2024

procyontao commented Apr 18, 2024 • edited Loading

procyontao commented Apr 18, 2024

donghuachensu commented Apr 18, 2024

olibclarke commented Apr 18, 2024

DanGonite57 commented Apr 18, 2024

olibclarke commented Apr 18, 2024 • edited Loading

donghuachensu commented May 2, 2024 • edited Loading

procyontao commented May 6, 2024

donghuachensu commented May 6, 2024

donghuachensu commented May 24, 2024 • edited Loading

procyontao commented May 24, 2024

donghuachensu commented May 24, 2024

ERROR:
ERROR: there was something wrong with system call: python /home/groups/kornberg/donghuac/spIsoNet/spIsoNet/bin/relion_wrapper.py Refine3D/job025/run_it001_half1_class001_external_reconstruct.star

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

donghuachensu commented Apr 16, 2024 •

edited

Loading

ERROR:
ERROR: there was something wrong with system call: python /data/donghua/spIsoNet/spIsoNet/bin/relion_wrapper.py Refine3D/job025/run_it001_half1_class001_external_reconstruct.star

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

procyontao commented Apr 18, 2024 •

edited

Loading

olibclarke commented Apr 18, 2024 •

edited

Loading

donghuachensu commented May 2, 2024 •

edited

Loading

donghuachensu commented May 24, 2024 •

edited

Loading