-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAS-2232 -small functions added to support the main solution in the t… #16
Conversation
This pull request should be to pull in to the feature branch: DAS-2208-SMAP-L3, not to main (yet) |
Right. Not sure how to switch it to the feature branch. |
@sudha-murthy you can switch the target branch by selecting to edit the pull request. It should be an option near the top of the page. |
hoss/coordinate_utilities.py
Outdated
override_variable = varinfo.get_variable(variable_name) | ||
|
||
if override_variable is not None and ( | ||
override_variable.is_latitude() or override_variable.is_longitude() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be ok in context, but as a stand-alone function it is a bit odd - taking in a variable_name, and then only working if that variable is either a latitude or longitude variable. In fact, the result is not really dependent upon the variable_name passed in, other than to use the path. It might be that it has the side effect of verifying the variable_name as a latitude or longitude, but that is because it calls upon the is_latitude and is_longitude VarInfo methods, which can be called directly. It raises the exception MissingCoordinateVariable if the variable passed in is not a latitude or longitude, but is a known variable, i.e., not intrinsically a missing coordinate use case.
It has the feel of a method perhaps taken as frequently used code, in which case the name could be check_coord_var_and_get_std_dimensions. Alternatively, if there is not the need to verify the variable is a coordinate, the code could be simplified to simply return standard dimensions within a shared path - a get_std_projected_dimension_reference. If the only use is in the following method, that would be my recommendation.
As is, the name could also be: get_std_projected_coordinate_dimensions, but I wonder about the need to check for is_latitidue or is_longitude.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will use your new suggested names
I originally had the is_latitude to map to projected_y and is_longitude to map to projected_x
I removed that.
But we do need to validate that are coordinates - we are using that path for the projected dimensions..
Maybe it does not have to be within this method..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made the updates - removed "override" and renamed the methods. removed the lat/lon check in the "get_projected_dimension_names' function"
commit - ca881d1
hoss/coordinate_utilities.py
Outdated
return projected_dimension_names | ||
|
||
|
||
def get_override_projected_dimensions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is expected to be updated when we do use a configuration file, but in the mean time, perhaps get_std_projected_dimensions is a better name. Also, such a revised name suggests a method that is well defined and ready to merge, vs. getting the projected dimensions, possibly with transposed names, which is TBD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used override because we are overriding the coordinates with projected dimensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we do use a configuration file
I'm going to sound like a broken record, but I do not think we should assume leaning on a configuration file is the best implementation. There are two relevant tenets I try to generally adhere to in our service code:
- Avoid hard-coding collection specific information into the code itself.
- Minimise the need for configuration file entries, to maintain a low barrier to entry for data providers to on-board collections to a service.
We have the information we need to determine dimension ordering while not needing configuration file-based information. And, if we write the code in a general way, this could open HOSS up to a large number of collections that are HDF-5 format but without proper 1-D dimension variables. That could be a massive win! But it would be stymied by data providers needing to add information to the configuration file. To date, no-one outside of DAS has added any configuration file information to any of our services in the last 3 years. That's pretty telling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, yup, I acknowledge that this monologue is not strictly relevant to this PR.
hoss/coordinate_utilities.py
Outdated
return set( | ||
variable | ||
for variable in variables | ||
if len(varinfo.get_variable(variable).dimensions) == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is where we need the additional:
if (len(varinfo.get_variable(variable).dimensions) == 0
or all([ for dimension in varinfo.get_variable(variable.dimensions) :
varinfo.get_variable(dimension) not None and not [] )
(excuse the pidgeon python)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I will include it as a comment till we make the configuration change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly prefer what we have now. Also, the function contents exactly matches the function name.
Also, the snippet supplied above is not correct Python code, so it's hard to know for sure what you are trying to achieve. Trying to decompose that snippet:
if (len(varinfo.get_variable(variable).dimensions) == 0
or all([ for dimension in varinfo.get_variable(variable.dimensions) : varinfo.get_variable(dimension) not None and not [] )
- The first bit still makes sense - if the variable in question doesn't have dimensions.
- Then, I think you are trying to see the
VarInfoFromDmr
instance does not have any of the listed dimensions as variables. - The
and not []
is a no-op. It will always evaluate toTrue
, because it is being evaluated in isolation, and you are asking if an empty list is "falsy", which it is.
While I don't like this approach, I think what you are trying to suggest would be more like:
if (
len(varinfo.get_variable(variable).dimensions) == 0
or all(
varinfo.get_variable(dimension) == None
for dimension in varinfo.get_variable(variable).dimensions
)
)
If this was to be augmented in such a way, I would recommend breaking this check out into it's own function, because the set comprehension will become very hard to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see splitting out the function to clarify the code and document the comprehension.
I'm less clear on forcing the upstream code to use this code without the additional check, and then add in the call to the new function in every usage. That additional check is now essential to ensure the case of OPeNDAP creating "empty" dimensions does not allow this check, by itself, to succeed. And of course, OPeNDAP's "empty" dimensions is pretty much always going to be the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if len(varinfo.get_variable(variable).dimensions) == 0 | |
if (len(varinfo.get_variable(variable).dimensions) == 0 | |
or any_absent_dimension_variables(variable) | |
... | |
def any_absent_dimension_variables(variable: VarInfo.variable) => bool | |
return any( | |
varinfo.get_variable(dimension) == None | |
for dimension in varinfo.get_variable(variable).dimensions | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function updated and unit tests added - d972777
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First up - nice work. There are a couple of things I really like in this PR:
- The scope of each function in
coordinate_utilities.py
tends to be of a single level of abstraction, making the functions fairly clean. - There are a bunch of unit tests. I think there are still a number of subtests that should be added, but based on what you've already put in place, most of them should be really quick to add.
I've added a lot of comments. I think some of them are really small things, like wording or additional doc-string comments. Some of them are extra test cases in the unit tests (and I am more than happy to help if you have unit test questions!), and some of them are acknowledging that things will be done in the future (but maybe a "TODO" comment would be helpful).
I think the biggest things in the code itself were:
- Potentially refactoring
get_lat_lon_arrays
to just get one array at a time, and reduce two blocks of near identical code. - Reworking the calculation for the dimension scale. Given the information you have, I think you don't need separate logical blocks for calculating an ascending versus a descending array. (The good news is that with your unit tests, you could drop in my suggestion and just see if it works with the test you already have)
- The fact that if you are trying to find valid indices for an array and specify the fill values, the latitude/longitude checks won't be applied.
- Accounting for longitude arrays that are 0 to 360 degrees north, not just -180 to 180.
Even these aren't massive things.
P.S. Did you want to close the old PR? It won't be deleted, so we'll still have access to comments.
hoss/coordinate_utilities.py
Outdated
f'{override_variable.group_path}/projected_y', | ||
f'{override_variable.group_path}/projected_x', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This choice of dimension names is better than before, but it's still likely to cause problems down the road. This function currently relies on the assumption that coordinate variables for different grids will be in different groups, which is not guaranteed.
(I think we see this with some of the gridded ICESat-2 ATLAS collections, for example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I could just have group_path/coordinate_name_projected_x or projected_y
The coordinate names would be unique
it would be latitude_projected_x or latitude_projected_y . It would not matter . it would be unique
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One approach (that would make this function more complicated 😞) could be to:
- (Already done) - Get the
VariableFromDmr
instance. - Get the references from the
coordinates
metadata attribute. - Find the latitude and longitude variables from those references and put them in a list. - Maybe reusing your
get_coordinate_variables
function. - Make a unique hash of those references.
- Prepend that hash on to
.../projected_x
and.../projected_y
(like you're already doing with the group path).
I took some inspiration from MaskFill. To try something like:
from hashlib import sha256
# Faking the coordinates list for now:
coordinates_list = ['/latitude', '/longitude']
coordinates_hash = sha256(' '.join(coordinates_list).encode('utf-8')).hexdigest()
projected_dimension_names = [
f'{coordinates_hash}/projected_y',
f'{coordinates_hash}/projected_x',
]
The string will look a bit ugly (e.g., 'a089b9ebff6935f6c8332710de2ee3b351bd47c1fb807b22765969617027e8d2'), but it will be unique and reproducible.
hoss/coordinate_utilities.py
Outdated
return projected_dimension_names | ||
|
||
|
||
def get_override_projected_dimensions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we do use a configuration file
I'm going to sound like a broken record, but I do not think we should assume leaning on a configuration file is the best implementation. There are two relevant tenets I try to generally adhere to in our service code:
- Avoid hard-coding collection specific information into the code itself.
- Minimise the need for configuration file entries, to maintain a low barrier to entry for data providers to on-board collections to a service.
We have the information we need to determine dimension ordering while not needing configuration file-based information. And, if we write the code in a general way, this could open HOSS up to a large number of collections that are HDF-5 format but without proper 1-D dimension variables. That could be a massive win! But it would be stymied by data providers needing to add information to the configuration file. To date, no-one outside of DAS has added any configuration file information to any of our services in the last 3 years. That's pretty telling.
hoss/coordinate_utilities.py
Outdated
return projected_dimension_names | ||
|
||
|
||
def get_override_projected_dimensions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, yup, I acknowledge that this monologue is not strictly relevant to this PR.
@owenlittlejohns @autydp - I think I have updated based on all the comments except the ones related to 3D and order of dimensions which will be next PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing all the smaller comments and adding a few more unit tests in. I think this is pretty much there.
I added a couple of other comments, but there was only one thing that I'd definitely like to see changed (the if coordinate_fill
to if coordinate_fill is not None
tweak).
I did also try to think some more about an appropriate prefix for the projected 1-D dimension variable names. I've popped an idea down there which might be helpful (I hope!). I'm not sure if that needs to be addressed in this PR, but I think it will need to be taken care off before the completion of the overall feature.
hoss/coordinate_utilities.py
Outdated
Returns indices of a valid array without fill values | ||
""" | ||
|
||
if coordinate_fill: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reworking this so that the fill value check and the latitude/longitude range checks can both happen.
I think the coordinate.is_latitude()
and coordinate.is_longitude()
checks could benefit from some numpy
magic, rather than looping individually through each element. I think what you could use is the element-wise-and, which can be either written as &
or np.logical_and
. You could do something like:
if coordinate_fill is not None:
is_not_fill = ~np.isclose(coordinate_row_col, float(coordinate_fill))
else:
# Creates an entire array of `True` values.
is_not_fill = np.ones_like(coordinate_row_col, dtype=bool)
if coordinate.is_longitude():
valid_indices = np.where(
np.logical_and(
is_not_fill,
np.logical_and(
coordinate_row_col >= -180.0,
coordinate_row_col <= 360.0
)
)
)
elif coordinate is latitude():
valid_indices = np.where(
np.logical_and(
is_not_fill,
np.logical_and(
coordinate_row_col >= -90.0,
coordinate_row_col <= 90.0
)
)
)
else:
valid_indices = np.empty((0, 0))
Note, in the snippet above, I've also changed the first check from if coordinate_fill
to if coordinate_fill is not None
. That's pretty important, as zero could be a valid fill value, but if coordinate_fill = 0
, then this check will evaluate to False
.
Ultimately, I think the conditions you have now are equivalent to this, just maybe not as efficient. So the only thing I'd definitely like to see changed is that first if coordinate_fill
condition, so that it's not considering of fill value of 0 to be non-fill.
hoss/coordinate_utilities.py
Outdated
return valid_indices | ||
|
||
|
||
def get_fill_value_for_coordinate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python plays a little fast and loose with types sometimes. If you have an integer and you are trying to do something with it and a float, then Python tries to be clever and treat that integer as a float.
For example:
In [1]: integer_variable = 2
In [2]: type(integer_variable)
Out[2]: int
In [3]: integer_variable == 2.0
Out[3]: True
So, I think we're probably okay in the instance that the fill value metadata attribute is an integer. (But I totally acknowledge that there is somewhat sneaky stuff Python is doing in the background here)
hoss/coordinate_utilities.py
Outdated
)[0] | ||
elif coordinate_name == 'longitude': | ||
valid_indices = np.where( | ||
(coordinate_row_col >= -180.0) & (coordinate_row_col <= 180.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just wanted to come back to this thread - I was wrong about the &
. But I think I prefer the equivalent np.logical_and
just because I'm generally too verbose.
hoss/coordinate_utilities.py
Outdated
f'{override_variable.group_path}/projected_y', | ||
f'{override_variable.group_path}/projected_x', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One approach (that would make this function more complicated 😞) could be to:
- (Already done) - Get the
VariableFromDmr
instance. - Get the references from the
coordinates
metadata attribute. - Find the latitude and longitude variables from those references and put them in a list. - Maybe reusing your
get_coordinate_variables
function. - Make a unique hash of those references.
- Prepend that hash on to
.../projected_x
and.../projected_y
(like you're already doing with the group path).
I took some inspiration from MaskFill. To try something like:
from hashlib import sha256
# Faking the coordinates list for now:
coordinates_list = ['/latitude', '/longitude']
coordinates_hash = sha256(' '.join(coordinates_list).encode('utf-8')).hexdigest()
projected_dimension_names = [
f'{coordinates_hash}/projected_y',
f'{coordinates_hash}/projected_x',
]
The string will look a bit ugly (e.g., 'a089b9ebff6935f6c8332710de2ee3b351bd47c1fb807b22765969617027e8d2'), but it will be unique and reproducible.
expected_row_col_sizes = (5, 10) | ||
lat_1d_array = np.array([1, 2, 3, 4]) | ||
lon_1d_array = np.array([6, 7, 8, 9]) | ||
lat_mismatched_array = np.array([[1, 2, 3], [3, 4, 5]]) | ||
lon_mismatched_array = np.array([[6, 7], [8, 9], [10, 11]]) | ||
lat_empty_ndim_array = np.array( | ||
0, | ||
) | ||
lon_empty_ndim_array = np.array( | ||
0, | ||
) | ||
lat_empty_size_array = np.array([]) | ||
lon_empty_size_array = np.array([]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I mentioned this somewhere else, but it's a lot easier for the reader if these things that relate to specific subtests were within the relevant with self.subTest(...)
context manager. From a quick skim, it doesn't look like these items are used in multiple subtests, so moving them shouldn't break anything. (Note - subtests aren't always executed in the order that they are written - I think unittest
orders things by alphabetising the test classes, methods and subtest descriptions in some way)
It's not a big deal, and not something to hold a PR up over, but it's a nice to have, which means I can look at a single subtest and see what's relevant to it in a single look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@owenlittlejohns on the earlier comment on get_fill_value function -
If I don't do the (float) typecast when I get the fill_value, I would get the following error. because the get_attribute _value method in varinfo returns a str value
TypeError: The DType <class 'numpy.FloatAbstractDType'> could not be promoted by <class 'numpy.dtype[str]'>.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@owenlittlejohns - updated the get_valid_indices function with your way. Had to add [0] since np.where returns a tuple.
Also corrected the subtests and moved the test data near where they are used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now. Thanks for the updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now. Sorry for the delay. Please note there is some confusion in my user account for GitHub - I fumbled setting things up properly and now have 2 (maybe 3?) accounts. Please request reviews and make references to the d-auty account.
Thanks David |
DAS-2232
Description
This PR updates the hoss_config.json , pip_requirements.txt, and a set of small functions in a new file called coordinate_utiliities.py. The aim of the ticket is use the coordinate variables to create dimension scales for collections that do not have support for CF Standards like dimension scales and grid mapping variable. This PR just has a few methods and supporting unit tests towards the final solution, This will be merged to the feature branch DAS-2208-SMAP-L3 till all the required changes for the ticket are complete -
DAS-2232
Local Test Steps
The unit tests should all pass
In a local harmony install
-bin/build-tests
-bin/run-tests
PR Acceptance Checklist
CHANGELOG.md
updated to include high level summary of PR changes.docker/service_version.txt
updated if publishing a release.