OOM issue and blank validation results using mcmc strategy #487

LaFeuilleMorte · 2024-11-12T06:35:44Z

Hi, I've tried mcmc strategy on smaller subset (about 542 images, 150, 000 initial point cloud), which works quite good. But when I'm using the whole dataset (about 973 images, 360, 000 initial point cloud). After about 6400 steps. It will raise cuda oom issue. And when I was trying to lower the cam_max to 500_000 for each GPU, the validation results would be blank images. I tested with the default strategy and it works well.

My command:

CUDA_VISIBLE_DEVICES=0, 1, 2, 3, 4, 5

python examples/simple_trainer.py mcmc
--data_dir {My_DATASET_DIR}
--data_factor 1
--result_dir ./results/{MY_OUTPUT_DIR}
--max_steps 50_000
--eval_steps 7_000 30_000 40_000 50_000
--save_steps 7_000 30_000 40_000 50_000
--use_bilateral_grid \

My log:
2024-11-12 14:31:14.833
Step 6200: Relocated 934401 GSs.
2024-11-12 14:31:14.833
Step 6200: Added 46996 GSs. Now having 986928 GSs.
2024-11-12 14:31:14.833
Step 6300: Relocated 984383 GSs.
2024-11-12 14:31:14.833
Step 6300: Added 13072 GSs. Now having 1000000 GSs.
2024-11-12 14:31:14.833
Step 6400: Relocated 995335 GSs.
2024-11-12 14:31:14.833
Step 6400: Added 0 GSs. Now having 1000000 GSs.
2024-11-12 14:31:18.561
Traceback (most recent call last):
2024-11-12 14:31:18.561
File "/aistudio/workspace/aigc/wangqihang013/aigc3d/repos/neural_rendering/high_quality/gsplat/examples/simple_trainer.py", line 1076, in
2024-11-12 14:31:18.575
cli(main, cfg, verbose=True)
2024-11-12 14:31:18.575
File "/aistudio/workspace/system-default/envs/gsplat/lib/python3.10/site-packages/gsplat/distributed.py", line 344, in cli
2024-11-12 14:31:18.579
process_context.join()
2024-11-12 14:31:18.579
File "/aistudio/workspace/system-default/envs/gsplat/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 163, in join
2024-11-12 14:31:18.580
raise ProcessRaisedException(msg, error_index, failed_process.pid)
2024-11-12 14:31:18.580
torch.multiprocessing.spawn.ProcessRaisedException:
2024-11-12 14:31:18.580

2024-11-12 14:31:18.580
-- Process 0 terminated with the following error:
2024-11-12 14:31:18.580
Traceback (most recent call last):
2024-11-12 14:31:18.580
File "/aistudio/workspace/system-default/envs/gsplat/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
2024-11-12 14:31:18.580
fn(i, *args)
2024-11-12 14:31:18.580
File "/aistudio/workspace/system-default/envs/gsplat/lib/python3.10/site-packages/gsplat/distributed.py", line 295, in _distributed_worker
2024-11-12 14:31:18.580
fn(local_rank, world_rank, world_size, args)
2024-11-12 14:31:18.580
File "/aistudio/workspace/aigc/wangqihang013/aigc3d/repos/neural_rendering/high_quality/gsplat/examples/simple_trainer.py", line 1021, in main
2024-11-12 14:31:18.580
runner.train()
2024-11-12 14:31:18.580
File "/aistudio/workspace/aigc/wangqihang013/aigc3d/repos/neural_rendering/high_quality/gsplat/examples/simple_trainer.py", line 589, in train
2024-11-12 14:31:18.580
renders, alphas, info = self.rasterize_splats(
2024-11-12 14:31:18.580
File "/aistudio/workspace/aigc/wangqihang013/aigc3d/repos/neural_rendering/high_quality/gsplat/examples/simple_trainer.py", line 469, in rasterize_splats
2024-11-12 14:31:18.580
render_colors, render_alphas, info = rasterization(
2024-11-12 14:31:18.580
File "/aistudio/workspace/system-default/envs/gsplat/lib/python3.10/site-packages/gsplat/rendering.py", line 497, in rasterization
2024-11-12 14:31:18.580
tiles_per_gauss, isect_ids, flatten_ids = isect_tiles(
2024-11-12 14:31:18.580
File "/aistudio/workspace/system-default/envs/gsplat/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-11-12 14:31:18.580
return func(*args, **kwargs)
2024-11-12 14:31:18.580
File "/aistudio/workspace/system-default/envs/gsplat/lib/python3.10/site-packages/gsplat/cuda/_wrapper.py", line 382, in isect_tiles
2024-11-12 14:31:18.580
tiles_per_gauss, isect_ids, flatten_ids = _make_lazy_cuda_func("isect_tiles")(
2024-11-12 14:31:18.580
File "/aistudio/workspace/system-default/envs/gsplat/lib/python3.10/site-packages/gsplat/cuda/_wrapper.py", line 14, in call_cuda
2024-11-12 14:31:18.580
return getattr(_C, name)(*args, **kwargs)
2024-11-12 14:31:18.580
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.24 GiB. GPU 0 has a total capacty of 39.42 GiB of which 3.14 GiB is free. Process 126936 has 36.29 GiB memory in use. Of the allocated memory 30.47 GiB is allocated by PyTorch, and 2.55 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

LaFeuilleMorte · 2024-11-12T10:00:25Z

Alright， I've known why this could happen. Because my dataset was twice larger than the previous one. And there are many noises in my point cloud. Consequently, the mcmc strategy under current configs will produce very large scale gaussians which prevent the gaussians to fit the scene. And according to this issue:

#464 (comment)

The large scales gaussian will cause the above error in function "isect_tiles"

So I use a larger scale_reg (scale_reg=0.05). And the problem seemed to disappear. But I'm not sure if I choose the optimum scale_reg cofficient.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM issue and blank validation results using mcmc strategy #487

OOM issue and blank validation results using mcmc strategy #487

LaFeuilleMorte commented Nov 12, 2024 •

edited

Loading

LaFeuilleMorte commented Nov 12, 2024

OOM issue and blank validation results using mcmc strategy #487

OOM issue and blank validation results using mcmc strategy #487

Comments

LaFeuilleMorte commented Nov 12, 2024 • edited Loading

LaFeuilleMorte commented Nov 12, 2024

LaFeuilleMorte commented Nov 12, 2024 •

edited

Loading