Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch generation with batch_dims in v0.2.0 is about 10-20times slower #121

Closed
dhruvbalwada opened this issue Nov 1, 2022 · 1 comment · Fixed by #132
Closed

Batch generation with batch_dims in v0.2.0 is about 10-20times slower #121

dhruvbalwada opened this issue Nov 1, 2022 · 1 comment · Fixed by #132
Labels
bug Something isn't working

Comments

@dhruvbalwada
Copy link

What is your issue?

I have been using xbatcher for a few weeks now, and happened to be around when version 0.1.0 switched to version 0.2.0. I noticed that there has been a massive slow down and increase in memory usage for my work flow, after the upgrade took place.

I have identified that the problem seems to occur when using the flag batch_dims.

The minimal code example given below takes about 6s to cycle through all the batches in the for-loop when using xbatcher 0.1.0, but takes about 1min 35 s to do the same cycle for xbatcher 0.2.0.

Along with the slow down in time, the memory usage also shoots up massively when using version 0.2.0 (often leading a kernel crash, if using a small cluster). Just looking at the memory used indicator in jupyter lab, I see the memory usage hover around 2.3GB constantly for 0.1.0, but cycle up to 16GB for 0.2.0.

Minimal example code:

import xarray as xr
import numpy as np
import xbatcher

da = xr.DataArray(np.ones((100,200, 50, 200)), dims=['X', 'Y', 'Z', 'time']).rename('da')

# batch_dims option
bgen = xbatcher.BatchGenerator(ds = da, input_dims={}, batch_dims={'time':10})

%%time
nbat = 0
for batch in bgen:
    print(nbat)
    nbat += 1
@weiji14 weiji14 added the bug Something isn't working label Nov 1, 2022
@maxrjones maxrjones mentioned this issue Nov 2, 2022
7 tasks
@maxrjones
Copy link
Member

maxrjones commented Nov 2, 2022

Thanks for the bug report and MVCE @dhruvbalwada. I don't have a patch yet, but have an explanation and will work to get this fixed promptly. I expect we'll make a patch release afterwards.

In v0.1.0, the batches were generated lazily but many features were missing (e.g,. __getitem__, __len__, in addition to the data loaders). In order to allow arbitrary batch access, #25 made batch generation eager. #112 generally improved performance by supporting lazy batch generation along with the features added since v0.1.0. The strategy for lazy batch generation is to store the indices of the batches rather than xarray data objects themselves. Your bug report reveals that only the indices associated with dimensions in input_dims are appropriately represented in the selector object, but not the indices of dimensions in batch_dims. The consequence is that batch_dims only impacts the number of batches returned, but not the size of the DataArray returned as a batch.

Two requirements for resolving this issue are:

  • Add a test for batch_dims (which should pass on v0.1.0 and fail on v0.2.0) (xref Testing improvements #83)
  • Modify the batches attribute to store indices associated with both batch_dims and input_dims

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants