-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
issues with dask interpolate #185
Comments
a couple updates after looking at this a bit. On the same dataset in the gist, It's still not clear to me why extensive and intensive aren't working, because we should be able to aggregate the results from area_tables_interpolate from the distributed dask dataframes the same way as we do for categorical. The only difference between categorical and intensive/extensive in the single core implementation is that categorical just does a row-sum across the table (masked by different categorical values) whereas the extensive/intensive first do a row-transform, then a dot product with the ext/int variables. Since the intersection table is correct (otherwise categorical couldnt be correct), its something about the math that's off, and my best guess would be it's the row-transform (maybe we're missing important neighbors in other partitions?). I dunno. Either way, if we use dask to distribute building the values of the AoI table, (i.e. replace _area_tables_binning_parallel, moving the parallelization down a level from where it is now) i think it would solve that issue anyway. Currently, we get a speedup with dask because we're breaking the geodataframes into spatial partitions and building several AoI tables, operating on them, and aggregating the results. But i think we need to (use dask to) break the pieces apart to build a single 'table' (sparse aoi matrix), then resume business as usual. Like with the new libpysal.Graph, i think maybe the way to handle this is to build the AoI table as an adjlist in the distributed dask df, then bring the chunks back and pivot to sparse, then send the
ok, i see dask needs the metadata ahead of time, so we need to create the new columns, but still not sure we need to cast as categorical instead of looking at unique vals and allowing string type? |
cc @martinfleis @ljwolf @sjsrey in case anyone else has thoughts |
Thanks very much for chipping in @knaaptime !!! A few thoughts, follow up's:
|
exactly. So we reimplement
I dont think we want to interpolate the chunks. The tests make clear that all the expensive computation is building the adjacency table, but multiplying through it is basically costless. Conversely, it is very expensive to try and send the multiplication through the distributed chunks. So in #187, lets just return the adjacency list, convert to (pandas not dask) multiindex series with the sparse dtype, then dump that out to sparse. Then we just pick up from right here, having already handled the expensive part |
@darribas can you take a look at what's going wrong with the dask interpolator? I want to cut a new release this month but dont want to do so until everything is passing. If you dont have time i'll make the dask stuff private until its ready |
i've finally had some time to take a closer look at the dask backend, so starting a thread here to triage some thoughts. In this gist I'm looking at timing for some pretty big datasets (source=2.7m, target=137k). tl;dr,:
distributed
, everything crashes)So this is really nice when you've got a large enough dataset, and it will be great to get it fleshed out for the other variable types. Now that i've interacted with the code a little more:
AttributeError: Can only use .cat accessor with a 'category' dtype
. If dask absolutely requires a categorical type, we should do a check or conversionunique
area_interpolate_dask
, and probably instead set the index toid_col
The text was updated successfully, but these errors were encountered: