-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
what's the best way to apply a channel slice? #109
Comments
To do this with maximal efficiency, you' have to use a pre-process to figure out the optimal chunking strategy for the channel dimensions # Initial dataset partition on FIELD_ID and DATA_DESC_ID
ddids = [ds.DATA_DESC_ID for ds in xds_from_ms("3C286.ms")]
# Read in very small DATA_DESCRIPTION table into memory
ddid = xds_from_table("3C286.ms::DATA_DESCRIPTION").compute()
# Create a dataset per row from SPECTRAL_WINDOW
spws = xds_from_table("3C286.ms::SPECTRAL_WINDOW", group_cols="__row__")
# Number of channels for each dataset
nchan = [spws[ddid[i].SPECTRAL_WINDOW_ID[0]].CHAN_FREQ.shape[0] for i in ddids]
# Channel chunking schema for each dataset
chan_chunks = [(chanslice.start - 0, chanslice.end - chanslice.start, nc - chanslice.end)
for nc in nchan]
# Chunking schema for each dataset
chunks = [{'row': 100000, 'chan': cc} for cc in chan_chunks]
# Re-open exact same datasets with a different chunking strategy
datasets = xds_from_ms("3C286.ms", chunks=chunks)
# This should slice the channel selection optimally without dask block overlap
datasets[0].DATA[:, chanslice, :] I typed this out without running it, but it should illustrate the idea. The above is clunky, I'm thinking about general improvements for the process in #86 |
@sjperkins , I'm unable to follow properly. I usually just do |
It's not wrong, but that way will end up reading more data from the MS than necessary. What the example is trying to demonstrate is that to prevent this behaviour the dask chunking must be correctly set up in I agree that it's not necessarily easy to follow, I simply haven't found a reasonable way of making this easy yet. #86 should probably form the nucleus for dealing with this. |
Lets look at this with a concrete example nchan = 64
chanslice = slice(37, 56)
chan_chunks = (37, 19, 8) = (37 - 0, 56 - 37, 64 - 56)
datasets = xds_from_ms(..., chunks={..., 'chan': chan_chunks})
ds.DATA.data[:, chanslice, :] that will create DATA dask arrays with channel chunks of (37, 19, 8). Each channel chunk will read that channel chunk from the MS exactly using getcolslice. If by contrast, we don't ask for channel chunking: datasets = xds_from_ms(...)
ds.DATA.data[:, chanslice, :] there'll be a single channel chunk for the dask DATA arrays containing the full 64 channels. So there'll be a getcol operation behind that reads all the channel data followed by a slice that returns the channel subset. |
I also want to say its good that people are pointing this out as a conceptually difficult thing, because I do want to make things easier and reduce the cognitive overhead. |
As an aging professor, I can only applaud the sentiment. My cognitive is all overhead! |
This makes it clearer, thanks :) ! |
Been asking elsewhere, but it occurs to me it's a general enough question to be usefully asked here....
So, what's the best way to read a subset of channels? I can apply a slice to the array objects of course, but is there a way to do this up front? Or a better way?
The text was updated successfully, but these errors were encountered: