DAS-2232 -small functions added to support the main solution in the t… #16

sudha-murthy · 2024-10-18T22:22:30Z

DAS-2232

Description

This PR updates the hoss_config.json , pip_requirements.txt, and a set of small functions in a new file called coordinate_utiliities.py. The aim of the ticket is use the coordinate variables to create dimension scales for collections that do not have support for CF Standards like dimension scales and grid mapping variable. This PR just has a few methods and supporting unit tests towards the final solution, This will be merged to the feature branch DAS-2208-SMAP-L3 till all the required changes for the ticket are complete -

DAS-2232

Local Test Steps

The unit tests should all pass
In a local harmony install

fetch the DAS-2232-1 branch from harmony-opendap-subsetter repo
cd harmony-opendap-subsetter
bin/build-image
-bin/build-tests
-bin/run-tests

PR Acceptance Checklist

Jira ticket acceptance criteria met.
CHANGELOG.md updated to include high level summary of PR changes.
docker/service_version.txt updated if publishing a release.
[X ] Tests added/updated and passing.
Documentation updated (if needed).

…icket

autydp · 2024-10-20T16:14:59Z

This pull request should be to pull in to the feature branch: DAS-2208-SMAP-L3, not to main (yet)

sudha-murthy · 2024-10-21T01:18:05Z

This pull request should be to pull in to the feature branch: DAS-2208-SMAP-L3, not to main (yet)

Right. Not sure how to switch it to the feature branch.

owenlittlejohns · 2024-10-21T02:37:28Z

@sudha-murthy you can switch the target branch by selecting to edit the pull request. It should be an option near the top of the page.

autydp · 2024-10-20T21:20:36Z

hoss/coordinate_utilities.py

+    override_variable = varinfo.get_variable(variable_name)
+
+    if override_variable is not None and (
+        override_variable.is_latitude() or override_variable.is_longitude()


This might be ok in context, but as a stand-alone function it is a bit odd - taking in a variable_name, and then only working if that variable is either a latitude or longitude variable. In fact, the result is not really dependent upon the variable_name passed in, other than to use the path. It might be that it has the side effect of verifying the variable_name as a latitude or longitude, but that is because it calls upon the is_latitude and is_longitude VarInfo methods, which can be called directly. It raises the exception MissingCoordinateVariable if the variable passed in is not a latitude or longitude, but is a known variable, i.e., not intrinsically a missing coordinate use case.

It has the feel of a method perhaps taken as frequently used code, in which case the name could be check_coord_var_and_get_std_dimensions. Alternatively, if there is not the need to verify the variable is a coordinate, the code could be simplified to simply return standard dimensions within a shared path - a get_std_projected_dimension_reference. If the only use is in the following method, that would be my recommendation.

As is, the name could also be: get_std_projected_coordinate_dimensions, but I wonder about the need to check for is_latitidue or is_longitude.

Will use your new suggested names

I originally had the is_latitude to map to projected_y and is_longitude to map to projected_x
I removed that.
But we do need to validate that are coordinates - we are using that path for the projected dimensions..
Maybe it does not have to be within this method..

made the updates - removed "override" and renamed the methods. removed the lat/lon check in the "get_projected_dimension_names' function"
commit - ca881d1

autydp · 2024-10-20T21:25:23Z

hoss/coordinate_utilities.py

+    return projected_dimension_names
+
+
+def get_override_projected_dimensions(


I'm not sure if this is expected to be updated when we do use a configuration file, but in the mean time, perhaps get_std_projected_dimensions is a better name. Also, such a revised name suggests a method that is well defined and ready to merge, vs. getting the projected dimensions, possibly with transposed names, which is TBD.

Used override because we are overriding the coordinates with projected dimensions.

when we do use a configuration file

I'm going to sound like a broken record, but I do not think we should assume leaning on a configuration file is the best implementation. There are two relevant tenets I try to generally adhere to in our service code:

Avoid hard-coding collection specific information into the code itself.

Minimise the need for configuration file entries, to maintain a low barrier to entry for data providers to on-board collections to a service.

We have the information we need to determine dimension ordering while not needing configuration file-based information. And, if we write the code in a general way, this could open HOSS up to a large number of collections that are HDF-5 format but without proper 1-D dimension variables. That could be a massive win! But it would be stymied by data providers needing to add information to the configuration file. To date, no-one outside of DAS has added any configuration file information to any of our services in the last 3 years. That's pretty telling.

And, yup, I acknowledge that this monologue is not strictly relevant to this PR.

autydp · 2024-10-20T22:14:34Z

hoss/coordinate_utilities.py

+    return set(
+        variable
+        for variable in variables
+        if len(varinfo.get_variable(variable).dimensions) == 0


Here is where we need the additional:

if (len(varinfo.get_variable(variable).dimensions) == 0 or all([ for dimension in varinfo.get_variable(variable.dimensions) : varinfo.get_variable(dimension) not None and not [] )

(excuse the pidgeon python)

Maybe I will include it as a comment till we make the configuration change?

I strongly prefer what we have now. Also, the function contents exactly matches the function name.

Also, the snippet supplied above is not correct Python code, so it's hard to know for sure what you are trying to achieve. Trying to decompose that snippet:

if (len(varinfo.get_variable(variable).dimensions) == 0 or all([ for dimension in varinfo.get_variable(variable.dimensions) : varinfo.get_variable(dimension) not None and not [] )

The first bit still makes sense - if the variable in question doesn't have dimensions.

Then, I think you are trying to see the VarInfoFromDmr instance does not have any of the listed dimensions as variables.

The and not [] is a no-op. It will always evaluate to True, because it is being evaluated in isolation, and you are asking if an empty list is "falsy", which it is.

While I don't like this approach, I think what you are trying to suggest would be more like:

if ( len(varinfo.get_variable(variable).dimensions) == 0 or all( varinfo.get_variable(dimension) == None for dimension in varinfo.get_variable(variable).dimensions ) )

If this was to be augmented in such a way, I would recommend breaking this check out into it's own function, because the set comprehension will become very hard to read.

I can see splitting out the function to clarify the code and document the comprehension.
I'm less clear on forcing the upstream code to use this code without the additional check, and then add in the call to the new function in every usage. That additional check is now essential to ensure the case of OPeNDAP creating "empty" dimensions does not allow this check, by itself, to succeed. And of course, OPeNDAP's "empty" dimensions is pretty much always going to be the case.

Suggested change

if len(varinfo.get_variable(variable).dimensions) == 0

if (len(varinfo.get_variable(variable).dimensions) == 0

or any_absent_dimension_variables(variable)

...

def any_absent_dimension_variables(variable: VarInfo.variable) => bool

return any(

varinfo.get_variable(dimension) == None

for dimension in varinfo.get_variable(variable).dimensions

)

function updated and unit tests added - d972777

owenlittlejohns

First up - nice work. There are a couple of things I really like in this PR:

The scope of each function in coordinate_utilities.py tends to be of a single level of abstraction, making the functions fairly clean.
There are a bunch of unit tests. I think there are still a number of subtests that should be added, but based on what you've already put in place, most of them should be really quick to add.

I've added a lot of comments. I think some of them are really small things, like wording or additional doc-string comments. Some of them are extra test cases in the unit tests (and I am more than happy to help if you have unit test questions!), and some of them are acknowledging that things will be done in the future (but maybe a "TODO" comment would be helpful).

I think the biggest things in the code itself were:

Potentially refactoring get_lat_lon_arrays to just get one array at a time, and reduce two blocks of near identical code.
Reworking the calculation for the dimension scale. Given the information you have, I think you don't need separate logical blocks for calculating an ascending versus a descending array. (The good news is that with your unit tests, you could drop in my suggestion and just see if it works with the test you already have)
The fact that if you are trying to find valid indices for an array and specify the fill values, the latitude/longitude checks won't be applied.
Accounting for longitude arrays that are 0 to 360 degrees north, not just -180 to 180.

Even these aren't massive things.

P.S. Did you want to close the old PR? It won't be deleted, so we'll still have access to comments.

hoss/coordinate_utilities.py

owenlittlejohns · 2024-10-21T21:52:31Z

hoss/coordinate_utilities.py

+            f'{override_variable.group_path}/projected_y',
+            f'{override_variable.group_path}/projected_x',


This choice of dimension names is better than before, but it's still likely to cause problems down the road. This function currently relies on the assumption that coordinate variables for different grids will be in different groups, which is not guaranteed.

(I think we see this with some of the gridded ICESat-2 ATLAS collections, for example)

I guess I could just have group_path/coordinate_name_projected_x or projected_y
The coordinate names would be unique
it would be latitude_projected_x or latitude_projected_y . It would not matter . it would be unique

One approach (that would make this function more complicated 😞) could be to:

(Already done) - Get the VariableFromDmr instance.

Get the references from the coordinates metadata attribute.

Find the latitude and longitude variables from those references and put them in a list. - Maybe reusing your get_coordinate_variables function.

Make a unique hash of those references.

Prepend that hash on to .../projected_x and .../projected_y (like you're already doing with the group path).

I took some inspiration from MaskFill. To try something like:

from hashlib import sha256 # Faking the coordinates list for now: coordinates_list = ['/latitude', '/longitude'] coordinates_hash = sha256(' '.join(coordinates_list).encode('utf-8')).hexdigest() projected_dimension_names = [ f'{coordinates_hash}/projected_y', f'{coordinates_hash}/projected_x', ]

The string will look a bit ugly (e.g., 'a089b9ebff6935f6c8332710de2ee3b351bd47c1fb807b22765969617027e8d2'), but it will be unique and reproducible.

owenlittlejohns · 2024-10-21T22:03:17Z

hoss/coordinate_utilities.py

+    return projected_dimension_names
+
+
+def get_override_projected_dimensions(


when we do use a configuration file

I'm going to sound like a broken record, but I do not think we should assume leaning on a configuration file is the best implementation. There are two relevant tenets I try to generally adhere to in our service code:

Avoid hard-coding collection specific information into the code itself.

Minimise the need for configuration file entries, to maintain a low barrier to entry for data providers to on-board collections to a service.

We have the information we need to determine dimension ordering while not needing configuration file-based information. And, if we write the code in a general way, this could open HOSS up to a large number of collections that are HDF-5 format but without proper 1-D dimension variables. That could be a massive win! But it would be stymied by data providers needing to add information to the configuration file. To date, no-one outside of DAS has added any configuration file information to any of our services in the last 3 years. That's pretty telling.

owenlittlejohns · 2024-10-21T22:06:32Z

hoss/coordinate_utilities.py

+    return projected_dimension_names
+
+
+def get_override_projected_dimensions(


And, yup, I acknowledge that this monologue is not strictly relevant to this PR.

tests/unit/test_coordinate_utilities.py

sudha-murthy · 2024-10-25T18:37:33Z

@owenlittlejohns @autydp - I think I have updated based on all the comments except the ones related to 3D and order of dimensions which will be next PR.
Have not done assertRaisesRegex for exception message. I can do that if it is needed,

owenlittlejohns

Thanks for addressing all the smaller comments and adding a few more unit tests in. I think this is pretty much there.

I added a couple of other comments, but there was only one thing that I'd definitely like to see changed (the if coordinate_fill to if coordinate_fill is not None tweak).

I did also try to think some more about an appropriate prefix for the projected 1-D dimension variable names. I've popped an idea down there which might be helpful (I hope!). I'm not sure if that needs to be addressed in this PR, but I think it will need to be taken care off before the completion of the overall feature.

owenlittlejohns · 2024-10-28T16:11:44Z

hoss/coordinate_utilities.py

+    Returns indices of a valid array without fill values
+    """
+
+    if coordinate_fill:


Thanks for reworking this so that the fill value check and the latitude/longitude range checks can both happen.

I think the coordinate.is_latitude() and coordinate.is_longitude() checks could benefit from some numpy magic, rather than looping individually through each element. I think what you could use is the element-wise-and, which can be either written as & or np.logical_and. You could do something like:

if coordinate_fill is not None: is_not_fill = ~np.isclose(coordinate_row_col, float(coordinate_fill)) else: # Creates an entire array of `True` values. is_not_fill = np.ones_like(coordinate_row_col, dtype=bool) if coordinate.is_longitude(): valid_indices = np.where( np.logical_and( is_not_fill, np.logical_and( coordinate_row_col >= -180.0, coordinate_row_col <= 360.0 ) ) ) elif coordinate is latitude(): valid_indices = np.where( np.logical_and( is_not_fill, np.logical_and( coordinate_row_col >= -90.0, coordinate_row_col <= 90.0 ) ) ) else: valid_indices = np.empty((0, 0))

Note, in the snippet above, I've also changed the first check from if coordinate_fill to if coordinate_fill is not None. That's pretty important, as zero could be a valid fill value, but if coordinate_fill = 0, then this check will evaluate to False.

Ultimately, I think the conditions you have now are equivalent to this, just maybe not as efficient. So the only thing I'd definitely like to see changed is that first if coordinate_fill condition, so that it's not considering of fill value of 0 to be non-fill.

tests/unit/test_coordinate_utilities.py

owenlittlejohns · 2024-10-28T17:03:30Z

hoss/coordinate_utilities.py

+    return valid_indices
+
+
+def get_fill_value_for_coordinate(


Python plays a little fast and loose with types sometimes. If you have an integer and you are trying to do something with it and a float, then Python tries to be clever and treat that integer as a float.

For example:

In [1]: integer_variable = 2 In [2]: type(integer_variable) Out[2]: int In [3]: integer_variable == 2.0 Out[3]: True

So, I think we're probably okay in the instance that the fill value metadata attribute is an integer. (But I totally acknowledge that there is somewhat sneaky stuff Python is doing in the background here)

owenlittlejohns · 2024-10-28T17:04:48Z

hoss/coordinate_utilities.py

+        )[0]
+    elif coordinate_name == 'longitude':
+        valid_indices = np.where(
+            (coordinate_row_col >= -180.0) & (coordinate_row_col <= 180.0)


I just wanted to come back to this thread - I was wrong about the &. But I think I prefer the equivalent np.logical_and just because I'm generally too verbose.

owenlittlejohns · 2024-10-28T17:23:22Z

hoss/coordinate_utilities.py

+            f'{override_variable.group_path}/projected_y',
+            f'{override_variable.group_path}/projected_x',


One approach (that would make this function more complicated 😞) could be to:

(Already done) - Get the VariableFromDmr instance.

Get the references from the coordinates metadata attribute.

Find the latitude and longitude variables from those references and put them in a list. - Maybe reusing your get_coordinate_variables function.

Make a unique hash of those references.

Prepend that hash on to .../projected_x and .../projected_y (like you're already doing with the group path).

I took some inspiration from MaskFill. To try something like:

from hashlib import sha256 # Faking the coordinates list for now: coordinates_list = ['/latitude', '/longitude'] coordinates_hash = sha256(' '.join(coordinates_list).encode('utf-8')).hexdigest() projected_dimension_names = [ f'{coordinates_hash}/projected_y', f'{coordinates_hash}/projected_x', ]

The string will look a bit ugly (e.g., 'a089b9ebff6935f6c8332710de2ee3b351bd47c1fb807b22765969617027e8d2'), but it will be unique and reproducible.

owenlittlejohns · 2024-10-28T17:43:37Z

tests/unit/test_coordinate_utilities.py

+        expected_row_col_sizes = (5, 10)
+        lat_1d_array = np.array([1, 2, 3, 4])
+        lon_1d_array = np.array([6, 7, 8, 9])
+        lat_mismatched_array = np.array([[1, 2, 3], [3, 4, 5]])
+        lon_mismatched_array = np.array([[6, 7], [8, 9], [10, 11]])
+        lat_empty_ndim_array = np.array(
+            0,
+        )
+        lon_empty_ndim_array = np.array(
+            0,
+        )
+        lat_empty_size_array = np.array([])
+        lon_empty_size_array = np.array([])


I think I mentioned this somewhere else, but it's a lot easier for the reader if these things that relate to specific subtests were within the relevant with self.subTest(...) context manager. From a quick skim, it doesn't look like these items are used in multiple subtests, so moving them shouldn't break anything. (Note - subtests aren't always executed in the order that they are written - I think unittest orders things by alphabetising the test classes, methods and subtest descriptions in some way)

It's not a big deal, and not something to hold a PR up over, but it's a nice to have, which means I can look at a single subtest and see what's relevant to it in a single look.

@owenlittlejohns on the earlier comment on get_fill_value function -
If I don't do the (float) typecast when I get the fill_value, I would get the following error. because the get_attribute _value method in varinfo returns a str value

TypeError: The DType <class 'numpy.FloatAbstractDType'> could not be promoted by <class 'numpy.dtype[str]'>.

@owenlittlejohns - updated the get_valid_indices function with your way. Had to add [0] since np.where returns a tuple.
Also corrected the subtests and moved the test data near where they are used

73390fd

… on PR feedback

D-Auty

Looks good now. Thanks for the updates.

autydp

Looks good now. Sorry for the delay. Please note there is some confusion in my user account for GitHub - I fumbled setting things up properly and now have 2 (maybe 3?) accounts. Please request reviews and make references to the d-auty account.

sudha-murthy · 2024-11-01T14:04:19Z

Thanks David

sudha-murthy added 2 commits October 18, 2024 18:08

DAS-2232 -small functions added to support the main solution in the t…

5f8d987

…icket

DAS-2232 - updated notebook version to 7.2.2

14a634e

sudha-murthy requested a review from owenlittlejohns October 18, 2024 22:34

sudha-murthy changed the base branch from main to DAS-2208-SMAP-L3 October 21, 2024 03:13

autydp reviewed Oct 21, 2024

View reviewed changes

sudha-murthy requested a review from autydp October 21, 2024 15:02

owenlittlejohns requested changes Oct 22, 2024

View reviewed changes

sudha-murthy added 4 commits October 22, 2024 17:58

DAS-2232 - Updates based on PR feedback

ca881d1

DAS-2232 - updates to unit tests basd on PR feedback

36a3c40

DAS-2232 - removed commented code

b2b30c5

DAS-2232 - updated get_valid_indices method

51b110c

owenlittlejohns approved these changes Oct 28, 2024

View reviewed changes

sudha-murthy added 3 commits October 28, 2024 23:23

DAS-2232 - updated get_valid_indices method and some unit tests based…

73390fd

… on PR feedback

DAS-2232 - updated get_variables_with_anonymous_dims

9c22bfd

DAS-2232 - added unit tests for any_absent_dimension_variables funciton

d972777

D-Auty approved these changes Nov 1, 2024

View reviewed changes

sudha-murthy merged commit 3fc06f0 into DAS-2208-SMAP-L3 Nov 1, 2024
3 checks passed

sudha-murthy deleted the DAS-2232-1 branch November 1, 2024 13:45

autydp approved these changes Nov 1, 2024

View reviewed changes

		return projected_dimension_names


		def get_override_projected_dimensions(

-        if len(varinfo.get_variable(variable).dimensions) == 0
+        if (len(varinfo.get_variable(variable).dimensions) == 0
+            or any_absent_dimension_variables(variable)
+    ...
+    def any_absent_dimension_variables(variable: VarInfo.variable) => bool
+        return any(
+            varinfo.get_variable(dimension) == None
+            for dimension in varinfo.get_variable(variable).dimensions
+        )

		f'{override_variable.group_path}/projected_y',
		f'{override_variable.group_path}/projected_x',

DAS-2232 -small functions added to support the main solution in the t… #16

DAS-2232 -small functions added to support the main solution in the t… #16

Conversation

sudha-murthy commented Oct 18, 2024

Description

DAS-2232

Local Test Steps

PR Acceptance Checklist

autydp commented Oct 20, 2024 • edited Loading

sudha-murthy commented Oct 21, 2024

owenlittlejohns commented Oct 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

D-Auty Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

owenlittlejohns left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sudha-murthy commented Oct 25, 2024

owenlittlejohns left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

owenlittlejohns Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sudha-murthy Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

sudha-murthy Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

D-Auty left a comment

Choose a reason for hiding this comment

autydp left a comment

Choose a reason for hiding this comment

sudha-murthy commented Nov 1, 2024

autydp commented Oct 20, 2024 •

edited

Loading

D-Auty Oct 28, 2024 •

edited

Loading

owenlittlejohns left a comment •

edited

Loading

owenlittlejohns left a comment •

edited

Loading

owenlittlejohns Oct 28, 2024 •

edited

Loading

sudha-murthy Oct 29, 2024 •

edited

Loading

sudha-murthy Oct 29, 2024 •

edited

Loading