-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spIsoNet can't run as relion_external_reconstruct #3
Comments
Hi, I tested relion4 but not relion5, but we think the relion5 blush regularization shares similarities with spIsoNet denoising. This problem is probably related to failing to open a port, which is 45495 in your case. spIsoNet will automatically detect a port that is not been used for communication. If Anisotropy correction for half maps can be executed correctly, this RELION embedded spIsoNet should also work. What I have in mind is to check what differs between the environment when you are running "spisonet.py reconstruct" and the relion wrapper. Such as whether the correct conda is used, or whether there are firewall problems. |
Hi, Thanks for the reply. I also tested spIsoNet in Relion4 on my workstation (the previous one run on a cluster). Here is the error. Please take a look. The following warnings were encountered upon command-line parsing: 04-16 15:06:31, INFO voxel_size 1.399999976158142 [rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600714 milliseconds before timing out. [rank2]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600837 milliseconds before timing out. Traceback (most recent call last):
|
Hi, are the above two errors the same? I got one from a cluster and another one from the workstation. Any suggestions? Thanks! |
Hi, I now know that the "MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1...." problem is RELION reconstruction not properly performed. This will happen as spIsoNet does not works for RELION5. I still does not have any understanding how the NCCL related error happens. |
Again I want to confirm whether the Anisotropy Correction ("spisonet.py reconstruct") gives you these errors within the same environment. |
Please also see whether this "#2" is related |
I can confirm that the Anisotropy Correction worked without any error on my 2-GPU workstation which has the same type of GPU as my 4-GPU workstation where I got the second error above. |
Is relion 5 compatibility on the roadmap? Or for now, would you recommend to set up a separate installation of relion 4 for misalignment correction? |
Hi, are you sure that it does not work with RELION-5? Is this a matter of spIsoNet not running at all in RELION-5, or not giving the intended output? Because I have been running it through RELION-5 for initial testing before seeing this comment and it doesn't appear to have any issues, but I can't speak for whether it is producing "correct" results. |
In our hands with relion 5 it seems to run, but the EDIT: EDIT2: Scratch that, I don't think it is actually doing anything. Here is the log:
It seemingly runs and reconstructs, but never trains a model... EDIT 3: Nope, it is working - it just hadn't reached fine enough angular sampling. Working now. One thing I notice though - it defaults to using all GPUs - it would be better if somehow it could default to using the GPUs that have been assigned to this job in Relion (not sure if that is possible?) |
Hi All, I used the recommended option 1 (as the following) in the tutorial to do the installation of spIsoNet, why I could not see the bin directory (which should contain the program spisonet.py) created just under the directory of spIsoNet after the installation? git clone https://github.com/IsoNet-cryoET/spIsoNet.git Or just copy all the files in ~/spIsoNet/spIsoNet/bin/*.py to ~/spIsoNet/bin? |
Hi, The correct path is actually ~/spIsoNet/spIsoNet/bin/spisonet.py. All codes reside within the ~/spIsoNet/spIsoNet directory, so there's no need to move any files.
|
Thank you for your clarification! |
I found that the first error above {RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable} was related to the GPU node's setting for Compute Mode of Default (e.g. on our cluster: #SBATCH --gpu_cmode=shared), and the second error above (Some NCCL operations have failed or timed out) was corrected by this setting (export NCCL_P2P_DISABLE=1). |
Thank you for trouble shooting and report back |
I wonder in this file spIsoNet_v1.0_Tutorial.pdf, whether one more step (pip install .) in Option 3 for the Installation should be added as the last step? Please confirm. |
Hi Yun-Tao,
I got the following errors when I was using spIsoNet as external_reconstruct in Relion4 or Relion5-beta. Any suggestions? Thanks!
The following warnings were encountered upon command-line parsing:
WARNING: Option --keep_lowres is not a valid RELION argument
04-15 19:26:44, INFO voxel_size 1.399999976158142
04-15 19:26:45, INFO voxel_size 1.399999976158142
04-15 19:39:35, INFO Limit resolution to 10.0 for spIsoNet 3D FSC calculation. You can also tune this paramerter with --limit_res .
04-15 19:39:35, INFO calculating fast 3DFSC, this will take few minutes
04-15 19:43:02, INFO voxel_size 1.399999976158142
04-15 19:51:23, INFO The Refine3D/job025 folder already exists, outputs will write into this folder
04-15 19:51:27, INFO voxel_size 1.399999976158142
04-15 19:51:32, INFO spIsoNet correction until resolution 10.0A!
Information beyond 10.0A remains unchanged
04-15 19:54:34, INFO Start preparing subvolumes!
04-15 19:54:59, INFO Done preparing subvolumes!
04-15 19:54:59, INFO Start training!
04-15 19:55:02, INFO Port number: 45495
[W socket.cpp:464] [c10d] The server socket cannot be initialized on [::]:45495 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol).
Traceback (most recent call last):
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/bin/spisonet.py", line 8, in
sys.exit(main())
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 549, in main
fire.Fire(ISONET)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 182, in reconstruct
map_refine_n2n(halfmap1,halfmap2, mask_vol, fsc3d, alpha = alpha,beta=beta, voxel_size=voxel_size, output_dir=output_dir,
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/map_refine.py", line 145, in map_refine_n2n
network.train([data_dir_1,data_dir_2], output_dir, alpha=alpha,beta=beta, output_base=output_base0, batch_size=batch_size, epochs = epochs, steps_per_epoch = 1000,
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 265, in train
mp.spawn(ddp_train, args=(self.world_size, self.port_number, self.model,alpha,beta,
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 158, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 68, in _wrap
fn(i, *args)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 50, in ddp_train
model = model.cuda()
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 911, in cuda
return self._apply(lambda t: t.cuda(device))
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply
param_applied = fn(param)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 911, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Traceback (most recent call last):
File "/home/groups/kornberg/donghuac/spIsoNet/spIsoNet/bin/relion_wrapper.py", line 517, in
with mrcfile.open(mrc1_cor) as d1:
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/mrcfile/load_functions.py", line 139, in open
return NewMrc(name, mode=mode, permissive=permissive,
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/mrcfile/mrcfile.py", line 109, in init
self._open_file(name)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/mrcfile/mrcfile.py", line 126, in _open_file
self._iostream = open(name, self._mode + 'b')
FileNotFoundError: [Errno 2] No such file or directory: 'Refine3D/job025/corrected_run_it001_half1_class001_unfil.mrc'
in: /home/groups/kornberg/donghuac/relion/src/backprojector.cpp, line 1323
ERROR:
ERROR: there was something wrong with system call: python /home/groups/kornberg/donghuac/spIsoNet/spIsoNet/bin/relion_wrapper.py Refine3D/job025/run_it001_half1_class001_external_reconstruct.star
=== Backtrace ===
/home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x69) [0x4e6a59]
/home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi() [0x46434e]
/home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi(_ZN14MlOptimiserMpi12maximizationEv+0x1b02) [0x5222b2]
/home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x3e9) [0x523279]
/home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi(main+0x55) [0x4d47b5]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x7ff6f5fbb555]
/home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi() [0x4d805e]
ERROR:
ERROR: there was something wrong with system call: python /home/groups/kornberg/donghuac/spIsoNet/spIsoNet/bin/relion_wrapper.py Refine3D/job025/run_it001_half1_class001_external_reconstruct.star
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
The text was updated successfully, but these errors were encountered: