-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sampling integrator: large memory usage (> 16 GB) for multipass rendering in vectorized variants #1353
Comments
I've changed the title, because it might have seemed that I purposefully tried to run larger warp than warp limit, which was not case. This issue is only related to having too many SPP, so that the computation is internally split into multiple passes, which causes the aforementioned 16-32 GB peak memory usage, in both LLVM and Cuda, out of the blue. To add motivation... If you were to fix this issue, it would open up many options to massively speed up the rendering... Because it turns out, splitting workload into multiple smaller passes, and forcibly evaluating the spectrum contribution from integrator speeds up the computation by factors of 2 - 4. With manageable memory cost, proportional to number of pixels * SPP per pass, which is just around 800 MB with rgb variants and 1Mpx image + 64 SPP per pass. |
Hi @xacond00, I believe this is expected behaviour. There was a discussion here that similarly covered what you've encountered and the answer is still relevant
So while your solution in edit 3 may be fine for an independent sampler, more generally it may not be applicable. |
Yes I know that. It's really unfortunate, because of this single quirk, the software in default configuration (without using depreciated options), won't basically run on anything less than professional grade GPU's. Not even 4090 in some cases. Btw. in the current state, this applies to all samplers, not just non-independent, as you incorrectly changed in the title. |
Hello ! |
If you exclusively use independent sampler, you can use this workaround in the SamplingIntegrator: // Potentially render multiple passes
for (size_t i = 0; i < n_passes; i++) {
auto sampler = sensor->sampler()->fork();
sampler->seed(i * 512, wavefront_size);
render_sample(scene, sensor, sampler, block, aovs.get(), pos,
diff_scale_factor);
if (n_passes > 1) {
//sampler->advance(); // Will trigger a kernel launch of size 1
//sampler->schedule_state();
dr::eval(block->tensor());
}
} Ie. instead of advancing the sampler, just fork it with a new seed. Rendering different SPP per pass, you might have to also set wavefront size and SPP per pass in each fork, like moving the sampler setup code into the loop. |
Description
Trying to render any scene with a sampling integrator from within the compiled C++ binary, causes out of memory error or crash, when the wavefront size limit is surpassed in any way.
Eg. cornell box scene that uses 1024x1024 with 4095 spp (renders completely fine):
4096 spp (out of memory):
8192 spp (throws error):
The peak memory usage only gets 1000x higher for no reason... Even though mitsuba reports, the rendering was split into multiple passes... I thought that should lower memory usage, instead of increasing it ?
I'm also seeing the same thing, when trying to arbitrary split the rendering into smaller passes, withing the sampling integrator's code itself (eg. 4*64 vs full 256), the memory usage drastically increases, instead of decreasing ?
Steps to reproduce
master branch
Edit 1:
Here is a memory usage, when I split 1024 spp into 8 passes * 128 spp:
And here I tried removing
sampler->schedule_state()
inside the multi-pass loop, still with 8 passes:Although the rendering time roughly doubled....
Is this the expected behavior ?
Edit 2:
I thought about it, and to me it seems, that drjit tries to evaluate the whole state at once, which works out to those 16 GB with a maximum wavefront size. This doesn't happen when the n_passes == 1, because of the condition in there.
But still feels like a huge oversight, and should be definitely fixed.
Edit 3.
Just repeatedly forking the original sampler is as fast the original, yet still only uses 16 MB of memory (the image looks unbiased, because of the seed)
This definitely has to be a bug ? Since evaling whole image block doesn't cause memory usage to spike, but evaling bunch of Uint32 states in a sampler does ? Btw. the same behavior with stratified sampler... I would expect others to behave the same.
The text was updated successfully, but these errors were encountered: