Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failures of anisotropy correction of half-maps with tutorial data #17

Open
suqi-z opened this issue Nov 20, 2024 · 4 comments
Open

Failures of anisotropy correction of half-maps with tutorial data #17

suqi-z opened this issue Nov 20, 2024 · 4 comments

Comments

@suqi-z
Copy link

suqi-z commented Nov 20, 2024

The data and command are the same as in the tutorial, but when running Anisotropy Correction of half-maps, it reports the following problem:

spisonet.py reconstruct emd_8731_half_map_1.mrc emd_8731_half_map_2.mrc --aniso_file FSC3D.mrc --mask emd_8731_msk_1.mrc --limit_res 3.5 --epochs 30 --alpha 1 --beta 0.5 --output_dir isonet_maps --gpuID 0,1,2,3 --acc_batches 2
11-20 00:52:42, INFO voxel_size 1.309999942779541
11-20 00:52:43, INFO spIsoNet correction until resolution 3.5A!
Information beyond 3.5A remains unchanged
11-20 00:52:57, INFO Start preparing subvolumes!
11-20 00:53:06, INFO Done preparing subvolumes!
11-20 00:53:06, INFO Start training!
11-20 00:53:09, INFO Port number: 51405
learning rate 0.0003
['isonet_maps/emd_8731_half_map_1_data', 'isonet_maps/emd_8731_half_map_2_data']
0%| | 0/125 [00:00<?, ?batch/s][rank1]:W1120 00:53:26.205000 139648681686848 torch/_logging/_internal.py:1034] [0/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank3]:W1120 00:53:26.225000 140587963545408 torch/_logging/_internal.py:1034] [0/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank0]:W1120 00:53:26.263000 140600396216128 torch/_logging/_internal.py:1034] [0/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank2]:W1120 00:53:26.357000 139692869912384 torch/logging/internal.py:1034] [0/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
/tmp/tmpb9ffwjno/main.c:6:23: fatal error: stdatomic.h: No such file or directory
#include <stdatomic.h>
^
compilation terminated.
/tmp/tmpl5yntb4i/main.c:6:23: fatal error: stdatomic.h: No such file or directory
#include <stdatomic.h>
^
compilation terminated.
/tmp/tmph2p9sytq/main.c:6:23: fatal error: stdatomic.h: No such file or directory
#include <stdatomic.h>
^
compilation terminated.
/tmp/tmpbvg8egds/main.c:6:23: fatal error: stdatomic.h: No such file or directory
#include <stdatomic.h>
^
compilation terminated.
/tmp/tmph4ckum7q/main.c:6:23: fatal error: stdatomic.h: No such file or directory
#include <stdatomic.h>
^
compilation terminated.
/tmp/tmpmj2mj6b6/main.c:6:23: fatal error: stdatomic.h: No such file or directory
#include <stdatomic.h>
^
compilation terminated.
/tmp/tmpowqgpc9
/main.c:6:23: fatal error: stdatomic.h: No such file or directory
#include <stdatomic.h>
^
compilation terminated.
/tmp/tmp_vj7apqe/main.c:6:23: fatal error: stdatomic.h: No such file or directory
#include <stdatomic.h>
^
compilation terminated.
/tmp/tmptthkhvx
/main.c:6:23: fatal error: stdatomic.h: No such file or directory
#include <stdatomic.h>
^
compilation terminated.
/tmp/tmpjo5sie2e/main.c:6:23: fatal error: stdatomic.h: No such file or directory
#include <stdatomic.h>
^
compilation terminated.
/tmp/tmpqvdup2d8/main.c:6:23: fatal error: stdatomic.h: No such file or directory
#include <stdatomic.h>
^
compilation terminated.
/tmp/tmp8xzss5bc/main.c:6:23: fatal error: stdatomic.h: No such file or directory
#include <stdatomic.h>
^
compilation terminated.
0%| | 0/125 [00:08<?, ?batch/s]
W1120 00:53:33.367000 139899853596480 torch/multiprocessing/spawn.py:146] Terminating process 40535 via signal SIGTERM
W1120 00:53:33.367000 139899853596480 torch/multiprocessing/spawn.py:146] Terminating process 47366 via signal SIGTERM
W1120 00:53:33.368000 139899853596480 torch/multiprocessing/spawn.py:146] Terminating process 47503 via signal SIGTERM
Traceback (most recent call last):
File "/spshared/apps/miniconda3/envs/spisonet/bin/spisonet.py", line 8, in
sys.exit(main())
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 549, in main
fire.Fire(ISONET)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 182, in reconstruct
map_refine_n2n(halfmap1,halfmap2, mask_vol, fsc3d, alpha = alpha,beta=beta, voxel_size=voxel_size, output_dir=output_dir,
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/map_refine.py", line 145, in map_refine_n2n
network.train([data_dir_1,data_dir_2], output_dir, alpha=alpha,beta=beta, output_base=output_base0, batch_size=batch_size, epochs = epochs, steps_per_epoch = 1000,
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 265, in train
mp.spawn(ddp_train, args=(self.world_size, self.port_number, self.model,alpha,beta,
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 282, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 238, in start_processes
while not context.join():
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 189, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 76, in _wrap
fn(i, *args)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 116, in ddp_train
preds = model(x1)# + noise.cuda())
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 433, in _fn
return fn(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 38, in inner
return fn(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1636, in forward
else self._run_ddp_forward(*inputs, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1454, in _run_ddp_forward
return self.module(*inputs, **kwargs) # type: ignore[index]
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/unet.py", line 97, in forward
x, down_sampling_features = self.encoder(x)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/unet.py", line 98, in torch_dynamo_resume_in_forward_at_97
x = self.decoder(x, down_sampling_features)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1110, in call
return hijacked_callback(
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 948, in call
result = self._inner_convert(
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 472, in call
return _compile(
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_utils_internal.py", line 84, in wrapper_function
return StrobelightCompileTimeProfiler.profile_compile_time(
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_strobelight/compile_time_profiler.py", line 129, in profile_compile_time
return func(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 817, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 231, in time_wrapper
r = func(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 636, in compile_inner
out_code = transform_code_object(code, transform)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1185, in transform_code_object
transformations(instructions, code_options)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 178, in _fn
return fn(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 582, in transform
tracer.run()
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2451, in run
super().run()
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 893, in run
while self.step():
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 805, in step
self.dispatch_table[inst.opcode](self, inst)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2642, in RETURN_VALUE
self._return(inst)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2627, in _return
self.output.compile_subgraph(
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1098, in compile_subgraph
self.compile_and_call_fx_graph(tx, list(reversed(stack_values)), root)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1318, in compile_and_call_fx_graph
compiled_fn = self.call_user_compiler(gm)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 231, in time_wrapper
r = func(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1409, in call_user_compiler
raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1390, in call_user_compiler
compiled_fn = compiler_fn(gm, self.example_inputs())
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/dynamo/backends/distributed.py", line 565, in compile_fn
return self.backend_compile_fn(gm, example_inputs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/dynamo/repro/after_dynamo.py", line 129, in call
compiled_gm = compiler_fn(gm, example_inputs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/init.py", line 1951, in call
return compile_fx(model
, inputs
, config_patches=self.config)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1505, in compile_fx
return aot_autograd(
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/backends/common.py", line 69, in call
cg = aot_module_simplified(gm, example_inputs, **self.kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 954, in aot_module_simplified
compiled_fn, _ = create_aot_dispatcher_function(
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 231, in time_wrapper
r = func(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 687, in create_aot_dispatcher_function
compiled_fn, fw_metadata = compiler_fn(
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 461, in aot_dispatch_autograd
compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 231, in time_wrapper
r = func(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1410, in fw_compiler_base
return inner_compile(
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/repro/after_aot.py", line 84, in debug_wrapper
inner_compiled_fn = compiler_fn(gm, example_inputs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/debug.py", line 304, in inner
return fn(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 231, in time_wrapper
r = func(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 527, in compile_fx_inner
compiled_graph = fx_codegen_and_compile(
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 831, in fx_codegen_and_compile
compiled_fn = graph.compile_to_fn()
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/graph.py", line 1751, in compile_to_fn
return self.compile_to_module().call
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 231, in time_wrapper
r = func(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/graph.py", line 1680, in compile_to_module
self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/graph.py", line 1640, in codegen
self.scheduler.codegen()
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 231, in time_wrapper
r = func(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/scheduler.py", line 2741, in codegen
self.get_backend(device).codegen_node(node)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/codegen/cuda_combined_scheduling.py", line 69, in codegen_node
return self._triton_scheduling.codegen_node(node)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/codegen/simd.py", line 1148, in codegen_node
return self.codegen_node_schedule(node_schedule, buf_accesses, numel, rnumel)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/codegen/simd.py", line 1317, in codegen_node_schedule
src_code = kernel.codegen_kernel()
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/codegen/triton.py", line 2159, in codegen_kernel
**self.inductor_meta_common(),
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/codegen/triton.py", line 2047, in inductor_meta_common
"backend_hash": torch.utils._triton.triton_hash_with_backend(),
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/utils/_triton.py", line 63, in triton_hash_with_backend
backend = triton_backend()
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/utils/_triton.py", line 49, in triton_backend
target = driver.active.get_current_target()
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/runtime/driver.py", line 23, in getattr
self._initialize_obj()
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
self._obj = self._init_fn()
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/runtime/driver.py", line 9, in _create_driver
return actives0
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 371, in init
self.utils = CudaUtils() # TODO: make static
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 80, in init
mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 57, in compile_module_from_src
so = build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/runtime/build.py", line 48, in build
ret = subprocess.check_call(cc_cmd)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/subprocess.py", line 369, in check_call
raise CalledProcessError(retcode, cmd)
torch.dynamo.exc.BackendCompilerFailed: backend='compile_fn' raised:
CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmptthkhvx
/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmptthkhvx
/cuda_utils.cpython-310-x86_64-linux-gnu.so', '-lcuda', '-L/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/backends/nvidia/lib', '-L/lib64', '-I/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/backends/nvidia/include', '-I/tmp/tmptthkhvx
', '-I/spshared/apps/miniconda3/envs/spisonet/include/python3.10']' returned non-zero exit status 1.

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True

Thanks for your help!

@procyontao
Copy link
Collaborator

procyontao commented Nov 20, 2024 via email

@suqi-z
Copy link
Author

suqi-z commented Nov 22, 2024

Thanks for your reply!

After commenting out these lines, I rerun this step. It seems that there is no change.

The error information as followed:

spisonet.py reconstruct emd_8731_half_map_1.mrc emd_8731_half_map_2.mrc --aniso_file FSC3D.mrc --mask emd_8731_msk_1.mrc --limit_res 3.5 --epochs 30 --alpha 1 --beta 0.5 --output_dir isonet_maps --gpuID 0,1,2,3 --acc_batches 2
11-21 20:50:41, INFO voxel_size 1.309999942779541
11-21 20:50:52, INFO spIsoNet correction until resolution 3.5A!
Information beyond 3.5A remains unchanged
11-21 20:51:02, INFO Start preparing subvolumes!
11-21 20:51:31, INFO Done preparing subvolumes!
11-21 20:51:31, INFO Start training!
11-21 20:51:34, INFO Port number: 47458
learning rate 0.0003
['isonet_maps/emd_8731_half_map_1_data', 'isonet_maps/emd_8731_half_map_2_data']
0%| | 0/125 [00:00<?, ?batch/s][rank3]:W1121 20:51:49.720000 27636 site-packages/torch/_logging/_internal.py:1081] [0/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank2]:W1121 20:51:49.887000 18250 site-packages/torch/_logging/_internal.py:1081] [0/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank1]:W1121 20:51:49.899000 10667 site-packages/torch/_logging/_internal.py:1081] [0/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank0]:W1121 20:51:49.933000 5556 site-packages/torch/_logging/internal.py:1081] [0/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
/tmp/tmpbqb3z3m1/main.c:6:23: fatal error: stdatomic.h: No such file or directory
#include <stdatomic.h>
^
compilation terminated.
/tmp/tmpfmgniqil/main.c:6:23: fatal error: stdatomic.h: No such file or directory
#include <stdatomic.h>
^
compilation terminated.
/tmp/tmpwg638qpb/main.c:6:23: fatal error: stdatomic.h: No such file or directory
#include <stdatomic.h>
^
compilation terminated.
/tmp/tmp9no3ip6h/main.c:6:23: fatal error: stdatomic.h: No such file or directory
#include <stdatomic.h>
^
compilation terminated.
/tmp/tmp8083o31x/main.c:6:23: fatal error: stdatomic.h: No such file or directory
#include <stdatomic.h>
^
compilation terminated.
/tmp/tmphylv2zp
/main.c:6:23: fatal error: stdatomic.h: No such file or directory
#include <stdatomic.h>
^
compilation terminated.
/tmp/tmp1hpb997j/main.c:6:23: fatal error: stdatomic.h: No such file or directory
#include <stdatomic.h>
^
compilation terminated.
/tmp/tmppir027c6/main.c:6:23: fatal error: stdatomic.h: No such file or directory
#include <stdatomic.h>
^
compilation terminated.
/tmp/tmp4y1k5_9o/main.c:6:23: fatal error: stdatomic.h: No such file or directory
#include <stdatomic.h>
^
compilation terminated.
/tmp/tmpbkgoo9g4/main.c:6:23: fatal error: stdatomic.h: No such file or directory
#include <stdatomic.h>
^
compilation terminated.
/tmp/tmpvdinu5f3/main.c:6:23: fatal error: stdatomic.h: No such file or directory
#include <stdatomic.h>
^
compilation terminated.
/tmp/tmp51ilakio/main.c:6:23: fatal error: stdatomic.h: No such file or directory
#include <stdatomic.h>
^
compilation terminated.
0%| | 0/125 [00:06<?, ?batch/s]
[rank0]:[W1121 20:51:55.475865037 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W1121 20:51:55.676000 30273 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 5556 via signal SIGTERM
W1121 20:51:55.677000 30273 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 10667 via signal SIGTERM
W1121 20:51:55.678000 30273 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 18250 via signal SIGTERM
Traceback (most recent call last):
File "/spshared/apps/miniconda3/envs/spisonet/bin/spisonet.py", line 8, in
sys.exit(main())
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 549, in main
fire.Fire(ISONET)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 182, in reconstruct
map_refine_n2n(halfmap1,halfmap2, mask_vol, fsc3d, alpha = alpha,beta=beta, voxel_size=voxel_size, output_dir=output_dir,
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/map_refine.py", line 145, in map_refine_n2n
network.train([data_dir_1,data_dir_2], output_dir, alpha=alpha,beta=beta, output_base=output_base0, batch_size=batch_size, epochs = epochs, steps_per_epoch = 1000,
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 265, in train
mp.spawn(ddp_train, args=(self.world_size, self.port_number, self.model,alpha,beta,
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 328, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 284, in start_processes
while not context.join():
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 203, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 3 terminated with the following error:
Traceback (most recent call last):
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1446, in _call_user_compiler
compiled_fn = compiler_fn(gm, self.example_inputs())
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/dynamo/backends/distributed.py", line 506, in compile_fn
return self.backend_compile_fn(gm, example_inputs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/dynamo/repro/after_dynamo.py", line 129, in call
compiled_gm = compiler_fn(gm, example_inputs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/init.py", line 2234, in call
return compile_fx(model
, inputs
, config_patches=self.config)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1521, in compile_fx
return aot_autograd(
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/backends/common.py", line 72, in call
cg = aot_module_simplified(gm, example_inputs, **self.kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1071, in aot_module_simplified
compiled_fn = dispatch_and_compile()
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1056, in dispatch_and_compile
compiled_fn, _ = create_aot_dispatcher_function(
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 522, in create_aot_dispatcher_function
return _create_aot_dispatcher_function(
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 759, in _create_aot_dispatcher_function
compiled_fn, fw_metadata = compiler_fn(
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 588, in aot_dispatch_autograd
compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1350, in fw_compiler_base
return _fw_compiler_base(model, example_inputs, is_inference)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1421, in _fw_compiler_base
return inner_compile(
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 475, in compile_fx_inner
return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")(
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/repro/after_aot.py", line 85, in debug_wrapper
inner_compiled_fn = compiler_fn(gm, example_inputs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 661, in _compile_fx_inner
compiled_graph = FxGraphCache.load(
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 1334, in load
compiled_graph = compile_fx_fn(
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 570, in codegen_and_compile
compiled_graph = fx_codegen_and_compile(gm, example_inputs, **fx_kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 878, in fx_codegen_and_compile
compiled_fn = graph.compile_to_fn()
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/graph.py", line 1913, in compile_to_fn
return self.compile_to_module().call
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/graph.py", line 1839, in compile_to_module
return self._compile_to_module()
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/graph.py", line 1845, in _compile_to_module
self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/graph.py", line 1784, in codegen
self.scheduler.codegen()
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/scheduler.py", line 3383, in codegen
return self._codegen()
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/scheduler.py", line 3461, in _codegen
self.get_backend(device).codegen_node(node)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/codegen/cuda_combined_scheduling.py", line 80, in codegen_node
return self._triton_scheduling.codegen_node(node)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/codegen/simd.py", line 1155, in codegen_node
return self.codegen_node_schedule(node_schedule, buf_accesses, numel, rnumel)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/codegen/simd.py", line 1364, in codegen_node_schedule
src_code = kernel.codegen_kernel()
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/codegen/triton.py", line 2661, in codegen_kernel
**self.inductor_meta_common(),
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/codegen/triton.py", line 2532, in inductor_meta_common
"backend_hash": torch.utils._triton.triton_hash_with_backend(),
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/utils/_triton.py", line 53, in triton_hash_with_backend
backend = triton_backend()
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/utils/_triton.py", line 45, in triton_backend
target = driver.active.get_current_target()
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/runtime/driver.py", line 23, in getattr
self._initialize_obj()
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
self._obj = self._init_fn()
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/runtime/driver.py", line 9, in _create_driver
return actives0
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 371, in init
self.utils = CudaUtils() # TODO: make static
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 80, in init
mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 57, in compile_module_from_src
so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/runtime/build.py", line 48, in _build
ret = subprocess.check_call(cc_cmd)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/subprocess.py", line 369, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmp4y1k5_9o/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmp4y1k5_9o/cuda_utils.cpython-310-x86_64-linux-gnu.so', '-lcuda', '-L/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/backends/nvidia/lib', '-L/lib64', '-I/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/backends/nvidia/include', '-I/tmp/tmp4y1k5_9o', '-I/spshared/apps/miniconda3/envs/spisonet/include/python3.10']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
fn(i, *args)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 116, in ddp_train
preds = model(x1)# + noise.cuda())
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 465, in _fn
return fn(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 40, in inner
return fn(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1643, in forward
else self._run_ddp_forward(*inputs, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1459, in _run_ddp_forward
return self.module(*inputs, **kwargs) # type: ignore[index]
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/unet.py", line 97, in forward
x, down_sampling_features = self.encoder(x)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/unet.py", line 98, in torch_dynamo_resume_in_forward_at_97
x = self.decoder(x, down_sampling_features)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1263, in call
return hijacked_callback(
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1064, in call
result = self._inner_convert(
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 526, in call
return _compile(
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 924, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 666, in compile_inner
return _compile_inner(code, one_graph, hooks, transform)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_utils_internal.py", line 87, in wrapper_function
return function(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 699, in _compile_inner
out_code = transform_code_object(code, transform)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1322, in transform_code_object
transformations(instructions, code_options)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 219, in _fn
return fn(*args, **kwargs)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 634, in transform
tracer.run()
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2796, in run
super().run()
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 983, in run
while self.step():
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 895, in step
self.dispatch_table[inst.opcode](self, inst)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2987, in RETURN_VALUE
self._return(inst)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2972, in _return
self.output.compile_subgraph(
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1117, in compile_subgraph
self.compile_and_call_fx_graph(tx, list(reversed(stack_values)), root)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1369, in compile_and_call_fx_graph
compiled_fn = self.call_user_compiler(gm)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1416, in call_user_compiler
return self._call_user_compiler(gm)
File "/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1465, in _call_user_compiler
raise BackendCompilerFailed(self.compiler_fn, e) from e
torch._dynamo.exc.BackendCompilerFailed: backend='compile_fn' raised:
CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmp4y1k5_9o/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmp4y1k5_9o/cuda_utils.cpython-310-x86_64-linux-gnu.so', '-lcuda', '-L/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/backends/nvidia/lib', '-L/lib64', '-I/spshared/apps/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/backends/nvidia/include', '-I/tmp/tmp4y1k5_9o', '-I/spshared/apps/miniconda3/envs/spisonet/include/python3.10']' returned non-zero exit status 1.

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True

Thanks for your help!

@wsatbluesky
Copy link

I have same issue, is there any update?

@procyontao
Copy link
Collaborator

Hi

I do not have too much insight on this and can not reproduce this error.

It said stdatomic.h can not found. This file should be in the C standard library. There are two possibilities, your system have a outdated C library or your system have it but can not find. In my ubtuntu system, I do have stdatomic.h in /usr/lib/gcc/x86_64-linux-gnu/11/include, and I think when I do any C compiling with -std=c11, the stdatomic.h will be included.

Similar error should be occur when installing pytorch I do not quite know but maybe install a new compiler will help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants