We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I can successfully train on a single GPU with a batch size of 4, but am unable to train on 4 GPUs with a batch size of 16.
I get the following error message:
Lock file exists in build directory: '/gpfs/u/home/~/.cache/torch_extensions/nvdiffrast_plugin/lock' tick 0 kimg 0.0 time 27m 55s sec/tick 1665.6 sec/kimg 104099.05 maintenance 9.2 ==> start visualization Traceback (most recent call last): File "train_3d.py", line 339, in <module> main() # pylint: disable=no-value-for-parameter File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "train_3d.py", line 333, in main launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run) File "train_3d.py", line 107, in launch_training torch.multiprocessing.spawn(fn=subprocess_fn, args=(c, temp_dir), nprocs=c.num_gpus) File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 130, in join raise ProcessExitedException( torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGBUS
The text was updated successfully, but these errors were encountered:
No branches or pull requests
I can successfully train on a single GPU with a batch size of 4, but am unable to train on 4 GPUs with a batch size of 16.
I get the following error message:
The text was updated successfully, but these errors were encountered: