Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampling integrator: large memory usage (> 16 GB) for multipass rendering in vectorized variants #1353

Open
xacond00 opened this issue Oct 22, 2024 · 5 comments

Comments

@xacond00
Copy link

xacond00 commented Oct 22, 2024

Description

Trying to render any scene with a sampling integrator from within the compiled C++ binary, causes out of memory error or crash, when the wavefront size limit is surpassed in any way.
Eg. cornell box scene that uses 1024x1024 with 4095 spp (renders completely fine):
4095_spp
4096 spp (out of memory):
4096_spp
8192 spp (throws error):
image

The peak memory usage only gets 1000x higher for no reason... Even though mitsuba reports, the rendering was split into multiple passes... I thought that should lower memory usage, instead of increasing it ?

I'm also seeing the same thing, when trying to arbitrary split the rendering into smaller passes, withing the sampling integrator's code itself (eg. 4*64 vs full 256), the memory usage drastically increases, instead of decreasing ?

Steps to reproduce

master branch

  1. ... Compile C++ into binary
  2. ... Run llvm / cuda rgb variants on any scene with independent sampler and where spp * film_size surpasses wavefront_size_limit (0xffffffff).

Edit 1:

Here is a memory usage, when I split 1024 spp into 8 passes * 128 spp:
image
And here I tried removing sampler->schedule_state() inside the multi-pass loop, still with 8 passes:
image
Although the rendering time roughly doubled....

Is this the expected behavior ?

Edit 2:

I thought about it, and to me it seems, that drjit tries to evaluate the whole state at once, which works out to those 16 GB with a maximum wavefront size. This doesn't happen when the n_passes == 1, because of the condition in there.

if (n_passes > 1) {
                sampler->advance(); // Will trigger a kernel launch of size 1
                sampler->schedule_state();
                dr::eval(block->tensor());
}

But still feels like a huge oversight, and should be definitely fixed.

Edit 3.

Just repeatedly forking the original sampler is as fast the original, yet still only uses 16 MB of memory (the image looks unbiased, because of the seed)

 // Potentially render multiple passes
        for (size_t i = 0; i < n_passes; i++) {
            auto sampler = sensor->sampler()->fork();
            sampler->seed(i * 512, wavefront_size);
            render_sample(scene, sensor, sampler, block, aovs.get(), pos,
                          diff_scale_factor);

            if (n_passes > 1) {
                //sampler->advance(); // Will trigger a kernel launch of size 1
                //sampler->schedule_state();
                dr::eval(block->tensor());
            }
}

This definitely has to be a bug ? Since evaling whole image block doesn't cause memory usage to spike, but evaling bunch of Uint32 states in a sampler does ? Btw. the same behavior with stratified sampler... I would expect others to behave the same.

@xacond00 xacond00 changed the title Sampling integrator: extreme peak memory usage + crash, once you surpass wavefront size limit in any way Sampling integrator: extreme peak memory usage (> 16 GB), once the number of passes > 1 in both vectorized variants Oct 24, 2024
@xacond00
Copy link
Author

xacond00 commented Oct 24, 2024

I've changed the title, because it might have seemed that I purposefully tried to run larger warp than warp limit, which was not case. This issue is only related to having too many SPP, so that the computation is internally split into multiple passes, which causes the aforementioned 16-32 GB peak memory usage, in both LLVM and Cuda, out of the blue.

To add motivation... If you were to fix this issue, it would open up many options to massively speed up the rendering... Because it turns out, splitting workload into multiple smaller passes, and forcibly evaluating the spectrum contribution from integrator speeds up the computation by factors of 2 - 4. With manageable memory cost, proportional to number of pixels * SPP per pass, which is just around 800 MB with rgb variants and 1Mpx image + 64 SPP per pass.

@rtabbara
Copy link
Contributor

Hi @xacond00,

I believe this is expected behaviour. There was a discussion here that similarly covered what you've encountered and the answer is still relevant

Had a quick look at it this morning. It's a bit unfortunate but we need to evaluate and store the sampler's state between each pass, just in case it's some stratified sampler. That state can be surprisingly large, for example in an independent sampler it's represented by two 8 byte values per lane/thread:
2 * 8 * (2 ** 31 - 1) ~= 68 GB

We should maybe have a special code path for the independent sampler as it really doesn't need to store its state between passes.

So while your solution in edit 3 may be fine for an independent sampler, more generally it may not be applicable.

@merlinND merlinND changed the title Sampling integrator: extreme peak memory usage (> 16 GB), once the number of passes > 1 in both vectorized variants Sampling integrator: large memory usage (> 16 GB) for multipass rendering in vectorized variants with non-independent sampler Oct 28, 2024
@xacond00
Copy link
Author

xacond00 commented Oct 29, 2024

So while your solution in edit 3 may be fine for an independent sampler, more generally it may not be applicable.

Yes I know that.
But why doesn't the first pass and other computations cause any significant VRAM usage in that case ?

It's really unfortunate, because of this single quirk, the software in default configuration (without using depreciated options), won't basically run on anything less than professional grade GPU's. Not even 4090 in some cases.
Would host side caching on linux be possible at all ? Or in the very least, querying of available memory, to break down the spp per pass automatically ?

Btw. in the current state, this applies to all samplers, not just non-independent, as you incorrectly changed in the title.

@merlinND merlinND changed the title Sampling integrator: large memory usage (> 16 GB) for multipass rendering in vectorized variants with non-independent sampler Sampling integrator: large memory usage (> 16 GB) for multipass rendering in vectorized variants Oct 30, 2024
@Angom8
Copy link

Angom8 commented Nov 15, 2024

Hello !
I had this specific issue when i was handling adaptive sampling (and required multipass/non uniform SPP per pass). I worked on professional grade GPUs but it still was an important issue. I think it might not possible to reduce the consumption because of DrJIT's current implementation / loops.
I would like to help or hear again about this issue if a fix is being worked one though.

@xacond00
Copy link
Author

xacond00 commented Nov 19, 2024

Hello ! I had this specific issue when i was handling adaptive sampling (and required multipass/non uniform SPP per pass). I worked on professional grade GPUs but it still was an important issue. I think it might not possible to reduce the consumption because of DrJIT's current implementation / loops. I would like to help or hear again about this issue if a fix is being worked one though.

If you exclusively use independent sampler, you can use this workaround in the SamplingIntegrator:

 // Potentially render multiple passes
        for (size_t i = 0; i < n_passes; i++) {
            auto sampler = sensor->sampler()->fork();
            sampler->seed(i * 512, wavefront_size);
            render_sample(scene, sensor, sampler, block, aovs.get(), pos,
                          diff_scale_factor);

            if (n_passes > 1) {
                //sampler->advance(); // Will trigger a kernel launch of size 1
                //sampler->schedule_state();
                dr::eval(block->tensor());
            }
}

Ie. instead of advancing the sampler, just fork it with a new seed. Rendering different SPP per pass, you might have to also set wavefront size and SPP per pass in each fork, like moving the sampler setup code into the loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants