DAS-2232 - Functionality added to support SMAP L3 products #15

sudha-murthy · 2024-09-10T18:43:26Z

Description

SMAP L3 does not have 1D dimension scales and grid mapping variable which is needed by HOSS to do spatial subsetting.
Methods added to override the missing grid mapping with overrides in the hoss_config.json and supporting methods.
Methods also added to use the 2D coordinate datasets to generate 1D dimension scales that could be used for to calculate
the index ranges to provide the spatial subsetted outputs

Jira Issue ID

DAS-2232

Local Test Steps

Spatial subsetting can be tested without mask fill. (till DAS-2215 is complete)
Run Harmony in the box with HOSS.
Comment out of mask_fill in services.yaml file
Run a spatial subset request locally and ensure you get the right subsetted output
e.g.

3D variables do not work - pending on DAS-2238
New unit tests have not been added. The current unit tests are passing
DAS-2236 written to handle fill values in the corners.
Jupyter test notebooks exist for SMAP L3 and need to be updated

PR Acceptance Checklist

Jira ticket acceptance criteria met.
[X ] CHANGELOG.md updated to include high level summary of PR changes.
docker/service_version.txt updated if publishing a release.
Tests added/updated and passing.
Documentation updated (if needed).

hoss/spatial.py

hoss/dimension_utilities.py

…emoved duplicate method

…cate nethods

hoss/hoss_config.json

hoss/dimension_utilities.py

flamingbear

Sudha, There's a lot here to review and unfortunately I haven't worked with HOSS directly before. I tried to look through what I could and make decent suggestions.

But for sure, you need to test your changes. When you run the tests a coverage report is generated. Before your changes the results were:

Test Coverage Estimates
Name                           Stmts   Miss  Cover
--------------------------------------------------
Hoss/dimension_utilities.py      156      2    99%
hoss/spatial.py                   61      0   100%
--------------------------------------------------
TOTAL                            668      8    99%

And after they dropped considerably:


Test Coverage Estimates
Name                           Stmts   Miss  Cover
--------------------------------------------------
hoss/dimension_utilities.py      242     65    73%
hoss/spatial.py                   70      8    89%
--------------------------------------------------
TOTAL                            771     81    89%

It is very difficult to be able to understand the changes when I can't look at test and see what a function was supposed to do. Likewise, the function comments were not updated to describe the new functionality. Hopefully I'm not confusing anything with my lack of familiarity. I will defer to Owen if there's differences of opinion on what should be done.

Final test instructions assume a strong understanding of the problem you were solving, but I was able to eventually run the two test URLs and get output. Should the output files be geolocated somehow? Because they open in panoply, but aren't geo-2d, just 2d.

CHANGELOG.md

hoss/dimension_utilities.py

hoss/projection_utilities.py

hoss/spatial.py

owenlittlejohns

Thanks @sudha-murthy for putting a lot of effort into this PR!

This review is not 100% thorough. I think there are a few bigger things here that need to be addressed, and that will make it easier to digest in full:

I think the new functionality to deal with overriding dimensions should not just go in the get_projected_x_y_ranges. By doing so, it's adding complexity to that function, which makes things harder to understand overall. I also think it is forcing you to do some things that would otherwise be unnecessary - for example, I don't think you need to write the variables to the NetCDF-4 file, you just need to ask a function for 1-D dimension variables and spit out 1-D numpy arrays.
I'm really worried about the use of set objects to contain overriding dimensions. The ordering of dimensions is important, and must match the underlying array axes. I suspect this is the reason you are having issues with 3-D variables (it's certainly part of the issue).
A lot of the additional code is written in ways that adds high levels of cyclomatic complexity. I've suggested a few ways to cut that down, but it's worth taking a step back and working out what is common to code branches and what is truly different. Try to minimise what has to go in an if/elif/else, and then make sure each branch gives you the same output. The add_index_ranges function is perhaps the area of most concern.
There are a few places where the code branches and only assigns variables in one branch which are then used after the branching is finished. This will cause errors when something skips that block and later tries to access a non-existent variable.
This PR has no new or updated unit tests, either for new functions of new branches of code. I know I'm religious about these things, but they are absolutely needed. Particularly on a PR of this scope. There are a bunch of changes, and they are complicated and hard to understand. Not only do tests make sure things work, reading them makes it easier to understand how the code flows.
There are a lot of places where code has changed significantly, but the documentation strings describing the functions have not. They should definitely be updated to keep in sync with what's going on.

Sorry for all the comments, but hopefully they are helpful!

CHANGELOG.md

docker/service_version.txt

owenlittlejohns · 2024-09-20T20:05:26Z

hoss/dimension_utilities.py

@@ -55,6 +59,42 @@ def is_index_subset(message: Message) -> bool:
    )


+def get_override_projected_dimensions(


This PR does not include any changes to unit tests. Each new function you are adding (or new branch of code within an existing conditional block) needs unit testing that hits every branch of the code within a function. I've not reviewed any of the code changes in detail yet, but this lack of tests is an absolute blocker for me to approve the PR.

Yes. Owen. Will add the unit tests.

hoss/subset.py

hoss/dimension_utilities.py

owenlittlejohns · 2024-09-20T22:31:43Z

hoss/dimension_utilities.py

+    if variable.dimensions == []:
+        override_dimensions = get_override_dimensions(varinfo, [variable_name])
+        if len(override_dimensions) > 0:
+            for override in reversed(list(override_dimensions)):


Why are you reversing things here? The ordering should be preserved. If the order of override dimensions does not reflect the ordering of array axes, then that needs to be fixed.

I will revisit that. I had a problem with fixing the override dimensions list.

Changes are in commit 756f7c0

owenlittlejohns · 2024-09-20T22:40:47Z

hoss/dimension_utilities.py

@@ -422,22 +559,48 @@ def add_index_range(

    """


For this function: READ THIS COMMENT FIRST:

I think this is really overcomplicated. Fundamentally, all this function is doing is getting a list of dimensions for a variable. Then it is looking in a cache to see if there is an index range for any of the dimension names, before using that cache value to tack something on the end of a string.

Maybe I'm being naive, but it feels like really all you need is something to determine the correct list of variable dimensions, then all the rest of the logic (looking in the cache and appending strings to the end of the variable name) is the same. That latter stuff 100% does not need to be in the duplicated in multiple condition branches. It's making this function unnecessarily hard to read.

The other, really important comment: I am super suspicious of the bit where you are needing to reverse the order of the dimensions list. However that is derived, it should be reliably in the ordering of the variable axes. I'm wary that what this means is that you have found that for your particular use-case the "random" ordering in a set is predictable for the pseudo-dimensions you have for SMAP L3, and you can coincidentally impose the order you need coincidentally by reversing the set. I really think dimensions should not be passed around in sets, because you need that ordering. I strongly suspect this is the root cause of your issues with 3-D variables.

The ordering and the shape are things I need to get from varinfo. I dont have that information. Updated DAS-2241 for this.

Maybe rephrasing a different comment: I'm very uncomfortable with the thought of merging code with known problems into main. Going back to one of the things mentioned in the TRT breakout last week:

Teams ought to have a high-quality, maintainable baseline without known defects (particularly in new features) that is deployable to the production environment at all times.

Instead of making a massive PR that does 80% of a lot of stuff, this probably should be submitted piecemeal, with each piece being made 100% solid in a series of smaller PRs.

If you and David feel we're at a point that we can't break the changes down in that way, then a compromise might be to update this PR to merge into a feature branch. Then once all of the multiple PRs are merged into the feature branch, the feature branch can be merged into main.

I think terminology has confused some of this discussion. The question of reversing is not one of reversing dimensions, but the lat/lon variables coming from the Coordinates attribute. The recommendation here is that the get_coordinates method itself should return in a standard order (reversed in case), based upon the variable name - and not using reverse as shown here.

I don't think this is a case of "Known Issue".

We are planning to release a version for NSIDC to start testing with, which may not handle the 3D cases or other issues, but this release should not break any existing use cases, and should not truly break, but simply not handle as desired. Incremental feature release is a tenet of agile development.

Changes are in commit 756f7c0

I have simplified the method. The order for SMAP is 'projected_y', 'projected_x'. The override section of the code is only used by SMAP at the moment. It can be generalized if we can get that order of dimensions from varinfo. I am not sure if the order of dimensions is used for other products currently handled by hoss.

The use of sets for required_dimensions is based on how it is returned by varinfo and how it was originally in hoss before my changes. the bounds update requires the dimensions to be a set , It fails for a list

There's a difference between a set of all dimensions for all required variables (as used in the prefetch function), which aggregates all Variable.dimensions attributes (which individually are lists), and the dimensions on an individual variable. Variable.dimensions is not a set for ordering purposes, it is a list.

With regards to bounds - we know that the pseudo-dimensions won't have a bounds attribute, so you might be better to not try to find bounds variables for them. Then you'll avoid any compatibility issues there.

The add_index_range is a lot simpler now

hoss/dimension_utilities.py

owenlittlejohns · 2024-09-24T21:27:02Z

A couple of quick thoughts:

autydp yesterday
The review is a bit disjointed now with all the comments.

I agree that this PR is busy with a lot of comment. I'd recommend that @sudha-murthy picks a few of the comments in the PR to address at a time, pushes a commit up with them, and indicates in the comments that she has addressed them (with a link to the commit). Then the person making the comment can decide whether to mark it as resolved. Resolving the comments will make the PR look a lot less cluttered, and allow us to all focus on the remaining pieces.

flamingbear 4 hours ago
But since it looks like you've thumbed up a bunch of things, you can add a comment with the githash as a way to alert the commenter that you have addressed the issue, when you get there.

Yup - agreed.

CHANGELOG.md

…from_coordinates

owenlittlejohns · 2024-10-11T23:44:43Z

hoss/coordinate_utilities.py

+) -> bool:
+    """
+    returns the list of required variables without any
+    dimensions
+    """


Two quick things here:

The return type in the type hint is wrong: it should be set[str] not bool.

It would probably be clearer in the documentation string to say "set of required variables" not "list of required variables". (I'm guessing you meant list in a generic sense, rather than the Python type, but it's a little ambiguous)

owenlittlejohns · 2024-10-11T23:45:49Z

hoss/coordinate_utilities.py

+
+
+def get_variables_with_anonymous_dims(
+    varinfo: VarInfoFromDmr, required_variables: set[str]


Nitpick: It probably isn't necessary in this function to refer to the variables as required_variables. That is a piece of information outside of the scope of this function, maybe a better choice of name would be variable_names.

hoss/spatial.py

owenlittlejohns

Thanks for adding some unit tests - I'll keep an eye out for more in future commits.

I resolved older comments I had left, which now look to be taken care of (thanks for that). There are still outstanding older items, and I've added some more comments on things that I looked at in more detail this time around. (To be honest, given the huge scale of this PR, it's hard to review it all in a single go, and so there are still bits I'm spotting only now, despite trying to go through the PR a couple of times)

owenlittlejohns · 2024-10-12T00:31:16Z

hoss/coordinate_utilities.py

+    if lat_arr.ndim > 1:
+        col_size = lat_arr.shape[0]
+        row_size = lat_arr.shape[1]
+    if (lon_arr.shape[0] != lat_arr.shape[0]) or (lon_arr.shape[1] != lat_arr.shape[1]):
+        raise IrregularCoordinateDatasets(lon_arr.shape, lat_arr.shape)
+    if lat_arr.ndim and lon_arr.ndim == 1:
+        col_size = lat_arr.size
+        row_size = lon_arr.size


This is a bit wonky:

The middle check that the array sizes are equal explicitly checks the size of the 0th and 1st axes of the array. But after that you have a condition for whether the arrays only have one dimensions. This means one of two things:

Either the last bit with lat_array.ndim and lon_arr.ndim are both 1 will never get reached (because they are always 2-D or higher)

The coordinate arrays could be 1-D, and then the check for lon_arr.shape[1] != lat_arr.shape[1] will raise an exception, because the shape tuple doesn't have enough elements.

it is corrected.

owenlittlejohns · 2024-10-12T00:33:45Z

hoss/coordinate_utilities.py

+        row_size = lat_arr.shape[1]
+    if (lon_arr.shape[0] != lat_arr.shape[0]) or (lon_arr.shape[1] != lat_arr.shape[1]):
+        raise IrregularCoordinateDatasets(lon_arr.shape, lat_arr.shape)
+    if lat_arr.ndim and lon_arr.ndim == 1:


This condition isn't doing what I believe you think it is.

My guess is you are trying to check if both lat_arr and lon_arr have 1 dimension. What's really happening is that you are checking if lat_arr.ndim has a "truthy" value (so isn't 0, where it's an integer), and then you are checking if lon_arr.ndim == 1. If it helps, the checks are really:

if (lat_arr.ndim) and (lon_arr.ndim == 1):

I think what you are trying to do is:

if lat_arr.ndim == 1 and lon_arr.ndim == 1:

Sorry about that. I think it was a typo. I corrected it

owenlittlejohns · 2024-10-12T00:35:59Z

hoss/coordinate_utilities.py

+
+    """
+    override_variable = varinfo.get_variable(variable_name)
+    projected_dimension_name = ''


It would fit much better with the rest of the style of the code if you used else for these sorts of default values, instead of declaring them and then overriding them if a condition is met.

Yes, it's a personal preference, but it's consistent with the rest of the code (which is the more important thing here).

I think I had to declare it above as default value to keep the variable in scope for the return call

I don't think you've understood my comment - you have two choices here: declare it first, or declare it in an else. I am asking you to be consistent to the rest of the style of the repository and put the default value in an else statement.

To avoid two else statements, you could restructure the logic below as:

if override_variable is not None and override_variables.is_latitude(): projected_dimension_name = 'projected_y' elif override_variable is not None and override_variable.is_longitude(): projected_dimension_name = 'projected_x' else: projected_dimension_name = ''

I still have two other concerns here:

Returning an empty string seems weird.

The projected dimension name should be distinct in the case of two grids. Using just projected_x and projected_y will cause clashes for those SMAP L3 products with both global and polar grids. (This must be resolved before merging this PR)

corrected the names to add the group name (in next commit)

simplified the structure to not individually check for lat/lon (in next commit)

about returning an empty string..-technically it should never be called from our code unless we have the coordinates. I could throw an exception - just seemed overkill

owenlittlejohns · 2024-10-12T00:39:27Z

hoss/coordinate_utilities.py

+    row_size = 0
+    col_size = 0


This is one example but there are quite a few places in this code that I think default values are being used instead of raising a genuine exception. If none of the conditions below are met, then it would be much better to catch the issue now and raise a user-friendly exception before later code tries to use these values and finds it can't do what it needs to (and raises some unintelligible exception to the end user).

I think the question to ask with a bunch of these functions is: if they fall back on the default values, can the rest of the code work using those return values. I think in this case (and a few others) the honest answer is no.

owenlittlejohns · 2024-10-12T00:43:47Z

hoss/coordinate_utilities.py

+    lat_col = lat_arr[:, 0]
+    lon_row = lon_arr[0, :]


This is making an assumption that latitude and longitude always represent the same axes in an array for all collections. We know that isn't true: for example the GPM_3IMERGHH collection from GES DISC have things the wrong way around (time, lon, lat).

It would be better to take both arrays and check one row and one column of each to find which array is varying along each dimension.

I need to check both to get the same index values

Your answer does not explain what will happen for collections when the ordering is reversed. I'm about to write an big old essay on the way to do this that will avoid any assumptions on latitude and longitude ordering (which I genuinely think will lead to problems).

The new method gets row and column on both lat/lon datasets. Will add a test for lat/lon with both types of ordering

updated - de350b6

owenlittlejohns · 2024-10-12T01:50:54Z

tests/unit/test_spatial.py

+    def test_get_x_y_index_ranges_from_coordinates(
+        self,
+        mock_get_x_y_extents,
+        mock_get_dimension_index_range,
+    ):


This is a good test!

You definitely need either one more that requests SMAP L3 data for multiple grids, or this test could be updated to do so using a collection that has multiple grids.

owenlittlejohns · 2024-10-12T01:53:23Z

tests/unit/test_coordinate_utilities.py

+        # lat_arr = prefetch_dataset[self.latitude][:]
+        # lon_arr = prefetch_dataset[self.longitude][:]
+


These commented out lines should be removed.

removed - de350b6

owenlittlejohns · 2024-10-12T01:59:55Z

tests/unit/test_coordinate_utilities.py

+                lat_fill,
+                lon_fill,
+            )
+            for actual, expected in zip(actual_geo_corners, expected_geo_corners):


A couple of things here:

Why only self.assertAlmostEqual? You are defining the expected values, so you could define them to be exactly what you need.

(Possibly a personal preference) zip makes things a little more complicated. Instead you could do something like:

for index, expected_corner in enumerate(expected_geo_corners): self.assertTupleEqual(actual_geo_corners[index], expected_corner)

If you stick with zip, could you maybe tweak the variables in the loop so they are actual_corner and expected_corner. Just to make it easier when reading this test back. Thanks!

owenlittlejohns · 2024-10-12T02:03:00Z

tests/unit/test_coordinate_utilities.py

+
+        """
+
+        expected_result1 = np.array([0, 1, 2, 3, 4])


Something of a theme here - there are a bunch of variable names in these tests that are really vague (expected_result1, expected, expected_result, similar things with actual). It would be clearer to the reader if the names were more specific. Something like expected_valid_indices.

Unit tests are a secondary form of documentation for developers, and can be really informative. It's really helpful (to me at least 😅) if they are as easy to read as possible.

done - de350b6

owenlittlejohns · 2024-10-12T02:05:05Z

tests/unit/test_coordinate_utilities.py

Thanks for starting to add in the tests! A recommendation for next time would be to couple the writing of a function with the accompanying tests, instead of leaving all the tests to the end. (I tend to write tests for functions just after writing the code for the function itself, because that way it's still fresh in my head, but also when I move on from that function, I feel confident that it works when I call it elsewhere)

owenlittlejohns · 2024-10-15T16:30:36Z

hoss/dimension_utilities.py

+    if variable.dimensions:
+        range_strings = get_range_strings(variable.dimensions, index_ranges)
+    else:
+        override_dimensions = get_override_projected_dimensions(varinfo, variable_name)


@sudha-murthy - this is one of the things we discussed this morning, and I wanted to capture a comment on the PR so we don't lose it.

get_override_projected_dimensions only returns a list of fake spatial dimensions for the variable. However, for variables that have more dimensions (e.g., 3-D variables), this list is returning a 2-element list for a numpy array with 3 axes. This is causing the failure you reported with requests to OPeNDAP. When making requests to OPeNDAP each variable must follow one of two formats in the DAP4 constraint expression:

Not include any index information for the variable. E.g.: /variable_name.

Including index information for all dimensions. If you want the full range, you can just indicate that with []. So it's something like: /variable_name[][1:2][3:4]

Because the returned list of override dimensions only includes spatial information, this function is only constructing part of the list of index information (assuming the example from the second bullet point above is what you want, this function is actually just getting `/variable_name[1:2][3:4]).

The answer here is to use a function that retrieves something for all axes on the variable, and in the correct order. (The naming of the non spatial pseudo-dimensions doesn't matter, because all we're doing is seeing if there is information in the cache of index ranges. If not, the [] will be applied for that dimension)

This is the logic that is done for a variable with real dimensions:

for dimension in variable_dimensions: dimension_range = index_ranges.get(dimension) if dimension_range is not None and dimension_range[0] <= dimension_range[1]: range_strings.append(f'[{dimension_range[0]}:{dimension_range[1]}]') else: range_strings.append('[]')

You can see that if the dimension name is not in the cache of index_ranges, then [] is used.

the override code uses that same method.. But there is no 'dimension name' for the third dimension.

Yes - it sounds like we're pretty much agreed on the problem here:

get_override_projected_dimensions only returns a list of fake spatial dimensions for the variable. However, for variables that have more dimensions (e.g., 3-D variables), this list is returning a 2-element list for a numpy array with 3 axes.

So the solution is probably to be retrieving the list of variable dimensions in a different way (specifically how is still being hotly debated in a number of places).

owenlittlejohns · 2024-10-16T14:18:09Z

hoss/spatial.py

+    crs = get_variable_crs(non_spatial_variable, varinfo)
+    projected_x = 'projected_x'
+    projected_y = 'projected_y'
+    override_dimension_datasets = update_dimension_variables(


Please can you be consistent with the rest of the code and use "variable" to refer to variables, and reserve "dataset" to refer to the netCDF4.Dataset object. Using "datasets" to mean variables will cause confusion in the future. (I acknowledge that "dataset" is also often used to mean "variable" by other software, but it is not so far used that way in the HOSS code, due to the terminology of the netCDF4 Python package)

done - 0666248

Thanks for the update. I took a quick skim and there's one place remaining that uses dataset instead of variable. It's the get_valid_indices_in_dataset function. Otherwise, I think the language is now consistent.

the update_dimension_variables is now called create_dimension_scales_from_coordinates based on David's feedback (will be in the new commits in the new branch)

the get_valid_indices_in_dataset does get the valid indices from the numpy array passed in. Not sure if we can call it get_valid_indices_in_variable...

owenlittlejohns · 2024-10-16T16:10:19Z

Okay, this is me trying to capture my thoughts after our meeting yesterday (and mulling it over a bit more since):

I think, in general, I am fine with the way in which the projected x and projected y index ranges are calculated. While I don't think the calculation of the corners is strictly necessary, as you just need any two elements (preferably more, but two is enough) in a single row/column to infer the resolution and extents of a grid dimension, once there are a full battery of unit tests up, it is a methodology that should successfully get the index ranges. For the sake of moving this monster PR along, I am fine to proceed with the corners methodology.

I think there are three bigger pieces here that I'm holding out for (along with outstanding smaller items about general coding best practices):

Completing the unit test coverage of the new code. The percentage can sometimes be misleading, if you call a function that itself calls lower level functions. Those lower level functions will also need their own unit tests.
Handling of multiple grids. Hard-coding the projected dimension names in the index_ranges cache to projected_x and projected_y will cause issues for multiple grids (which some SMAP L3 products have). The dimension index ranges for both will be written to the same keys (so the last one to be calculated will be the only one stored in the cache). The pseudo-dimension names need some uniqueness, likely based on the combination of coordinates from which they are derived.
Relating the projected x and y pseudo-dimensions to the correct axes of the numpy arrays.

Why do we need this now? the current assumption that there is a specific ordering of these variables (['projected_y', projected_x']`) is a problem, as we know that we have some collections that are (..., lat, lon), some that are (lat, lon, ...), and some that are (..., lon, lat). Simplifying this to the 2-D case (which is the minimum @autydp wants to solve in this PR), we need to account for two options:

A science variable with array shape (a, b), where the projected x dimension (also shape (a, b)) varies across a row and the projected y dimension (also shape (a, b)) varies along a column.
A science variable with array shape (a, b) where the projected x dimension (also shape (a, b)) variable along a column and the projected y dimension (also shape (a, b)) varies across a row.

(Note - I believe the solution below would account for both N-D science variables and also degenerate dimension sizes (a, a))

Likely this only matters in the retrieval of the index ranges in the add_index_range function, as this is where the string for a variable is constructed to send a request to OPeNDAP and it must order the index ranges correctly.

How can we do this:

First - it's important to remember that 2-D Python array indices are of order (row, column).
Next up, the one assumption I am going to make here is that the ordering of spatial dimensions in the coordinate variables (and therefore the projected x and y variables) matches the ordering of the same dimensions in the science variables that refer to them (there may be other dimensions padding them on either side, but the order of those two dimensions is assumed to be the same). This doesn't generally matter for grids where there are a different number of rows and columns, but will matter for things like the polar grids of SMAP L3, where there are the same number of rows and columns (something like (500, 500) is I remember rightly).
To determine the dimension that corresponds to the row indices, you just look at the values in one row of the projected x and projected y variables. One of them should vary in a regular manner, the other should be the same for every element. If projected x varies along the row, then projected x corresponds to your row indices.
To determine the dimensions that correspond to the column indices, you do the same thing, but checking in a single column (in the previous bullet point just do s/row/column/).
If your first row or column doesn't have at least 2 non-fill values, then move on to the next row/column until you hit one that has multiple non-fill values. (If none of the rows or none of the columns have enough non-fill values, an exception should have been raised much sooner in the code because index ranges should not have been able to be calculated).
From the bullet points above you should know which projected grid dimension corresponds is your row and which is your column. This means you should know the order to pull items out of the index_ranges cache.
Bonus: For 3-D variables, all the same logic applies, except for non-spatial dimensions you want to make sure that you just retrieve the [] string. In a general case of a science variable with (..., a, b, ...), I would suggest that you can take the size of the 2-D projected coordinate arrays to determine (a, b) and then just work out where in the shape of the science variable that sequence occurs.

Just as a general point, the problem of dimension ordering (once you have the correct index ranges) in no way depends on what is a projected x coordinate and what is a projected y coordinate. It is entirely determined by which of those 2-D arrays varies along the a row and which varies up a column. Making an assumption that couples dimension ordering to which is x and which is y will inevitably break things for some future collection. (See GPM_3IMERGHH for an example where latitude and longitude are ordered opposite to how we would expect)

…id points

sudha-murthy · 2024-10-16T20:01:09Z

I think, in general, I am fine with the way in which the projected x and projected y index ranges are calculated. While I don't think the calculation of the corners is strictly necessary, as you just need any two elements (preferably more, but two is enough) in a single row/column to infer the resolution and extents of a grid dimension, once there are a full battery of unit tests up, it is a methodology that should successfully get the index ranges. For the sake of moving this monster PR along, I am fine to proceed with the corners methodology.

I have changed the corners method to getting just any 2 valid points. the version 8 had fill values in the lower right corner
I don't have all the unit tests in - in the last commit. will add them in the next commit

sudha-murthy · 2024-10-16T20:10:58Z

A science variable with array shape (a, b), where the projected x dimension (also shape (a, b)) varies across a row and the projected y dimension (also shape (a, b)) varies along a column.

A science variable with array shape (a, b) where the projected x dimension (also shape (a, b)) variable along a column and the projected y dimension (also shape (a, b)) varies across a row.

you mean latitude varies over row/col versus longitude varies over row/column right?
the projected_x and projected_y are 1D dimension scales

sudha-murthy · 2024-10-16T20:14:11Z

If your first row or column doesn't have at least 2 non-fill values, then move on to the next row/column until you hit one that has multiple non-fill values. (If none of the rows or none of the columns have enough non-fill values, an exception should have been raised much sooner in the code because index ranges should not have been able to be calculated).

I think my latest commit has some more checks to move to the next row or next column. It is not yet perfect since the same valid points have to be on the latitude and longitude grids

sudha-murthy · 2024-10-16T20:17:06Z

Handling of multiple grids. Hard-coding the projected dimension names in the index_ranges cache to projected_x and projected_y will cause issues for multiple grids (which some SMAP L3 products have). The dimension index ranges for both will be written to the same keys (so the last one to be calculated will be the only one stored in the cache). The pseudo-dimension names need some uniqueness, likely based on the combination of coordinates from which they are derived.

yes. will make sure I add that test case and verify and correct the uniqueness issues

owenlittlejohns · 2024-10-16T20:33:43Z

you mean latitude varies over row/col versus longitude varies over row/column right?
the projected_x and projected_y are 1D dimension scales

I suppose I was still in the mindset of projecting the whole latitude and longitude arrays, which still feels simpler. Given the corners approach, that's fair - the projected_x and projected_y are 1-D.

That said this indicated that the corners approach is losing the information we need to determine which projected spatial dimension corresponds to which dimension axes. I think this comment has actually made me realise the corners approach is insufficient to do what we need. We need the 2-D projected coordinate arrays (projected_x and projected_y) to know which is varying along the row and which is varying along a column.

hoss/subset.py

sudha-murthy added 6 commits September 9, 2024 10:24

DAS-2232 Initial commit for SMAP L3 spatial subset changes

79b4c3f

corrections from pre-commit checks

789cc6b

DAS-2232 - updates to correct unit test failures

c4abf26

DAS-2232 updates for unit test failures

be280f0

DAS-2232 - all unit tests pass

2a48f93

DAS-2232 updates to version number

2c93a4b

sudha-murthy requested review from owenlittlejohns and lyonthefrog September 10, 2024 18:48

sudha-murthy commented Sep 10, 2024

View reviewed changes

hoss/spatial.py Show resolved Hide resolved

sudha-murthy commented Sep 10, 2024

View reviewed changes

hoss/dimension_utilities.py Show resolved Hide resolved

sudha-murthy requested a review from flamingbear September 10, 2024 19:01

DAS-2232 - updated notebook version due to Synk vulnerability. Also r…

dfb1e15

…emoved duplicate method

sudha-murthy marked this pull request as draft September 12, 2024 17:40

DAS-2232 fixed spatial subsetting bugs introduced when removing dupli…

e2ff61f

…cate nethods

sudha-murthy marked this pull request as ready for review September 13, 2024 05:09

autydp reviewed Sep 17, 2024

View reviewed changes

hoss/hoss_config.json Show resolved Hide resolved

autydp reviewed Sep 17, 2024

View reviewed changes

hoss/dimension_utilities.py Outdated Show resolved Hide resolved

flamingbear requested changes Sep 20, 2024

View reviewed changes

owenlittlejohns requested changes Sep 20, 2024

View reviewed changes

autydp reviewed Sep 23, 2024

View reviewed changes

hoss/dimension_utilities.py Outdated Show resolved Hide resolved

autydp reviewed Sep 23, 2024

View reviewed changes

hoss/dimension_utilities.py Show resolved Hide resolved

sudha-murthy added 4 commits September 26, 2024 03:04

DAS-2232 initial commit after PR feedback

756f7c0

DAS-2232 - fixed get_variable_crs and a few minor corrections

53f1660

DAS-2232 - comments updated for the get_spatial_index_ranges method

f9f5e8b

DAS-2232 simplified get_spatial_index_ranges method

f836ee6

flamingbear reviewed Sep 26, 2024

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

flamingbear reviewed Sep 26, 2024

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

flamingbear reviewed Sep 26, 2024

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

flamingbear reviewed Sep 26, 2024

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

sudha-murthy added 12 commits October 3, 2024 11:18

DAS-2232 - simplified get_coordinate_variables based on PR feedback

802fe0e

DAS-2232 - added method to check for variables with no dimensions

60fb22a

DAS-2232 - added unittest for get_variable_crs

631dc24

DAS-2232 - added a module to contain coordinate methods

1e7bc35

DAS-2232 - moved new methods to a separate file

36e15c7

DAS-2232 - removed accidental checkin of an incomplete unit test

5c5eb85

DAS-2232 - added unit test for the new method - get_x_y_index_ranges_…

80c2fb2

…from_coordinates

DAS-2232 - added unit test for the new method - get_x_y_index_ranges_…

30eccd0

…from_coordinates

DAS-2232 - added some unit tests and some bug fixes

16872b7

DAS-2232 - added some unit tests and some bug fixes

822758f

DAS-2232 - minor initialization fix

7883465

DAS-2232 - added unit test for get_valid_indices

dd98e81

owenlittlejohns reviewed Oct 11, 2024

View reviewed changes

owenlittlejohns reviewed Oct 12, 2024

View reviewed changes

owenlittlejohns reviewed Oct 15, 2024

View reviewed changes

owenlittlejohns reviewed Oct 16, 2024

View reviewed changes

DAS-2232 - replaced get_geo_grid_corners method to get 2 valid geo gr…

de350b6

…id points

DAS-2232 - minor name from dataset to variable

0666248

owenlittlejohns changed the base branch from main to DAS-2208-SMAP-L3 October 16, 2024 22:03

DAS-2232 - added unit tests

e07099a

sudha-murthy requested a review from autydp October 17, 2024 15:25

owenlittlejohns reviewed Oct 18, 2024

View reviewed changes

hoss/subset.py Outdated Show resolved Hide resolved

owenlittlejohns mentioned this pull request Oct 22, 2024

DAS-2232 -small functions added to support the main solution in the t… #16

Merged

4 tasks

sudha-murthy closed this Oct 28, 2024

		@@ -55,6 +59,42 @@ def is_index_subset(message: Message) -> bool:
		)


		def get_override_projected_dimensions(



		def get_variables_with_anonymous_dims(
		varinfo: VarInfoFromDmr, required_variables: set[str]

		# lat_arr = prefetch_dataset[self.latitude][:]
		# lon_arr = prefetch_dataset[self.longitude][:]

DAS-2232 - Functionality added to support SMAP L3 products #15

DAS-2232 - Functionality added to support SMAP L3 products #15

Conversation

sudha-murthy commented Sep 10, 2024 • edited Loading

Description

Jira Issue ID

Local Test Steps

PR Acceptance Checklist

flamingbear left a comment

Choose a reason for hiding this comment

owenlittlejohns left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

owenlittlejohns Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

owenlittlejohns commented Sep 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

owenlittlejohns left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

owenlittlejohns commented Oct 16, 2024

sudha-murthy commented Oct 16, 2024 • edited Loading

sudha-murthy commented Oct 16, 2024

sudha-murthy commented Oct 16, 2024

sudha-murthy commented Oct 16, 2024

owenlittlejohns commented Oct 16, 2024

sudha-murthy commented Sep 10, 2024 •

edited

Loading

owenlittlejohns Sep 20, 2024 •

edited

Loading

sudha-murthy commented Oct 16, 2024 •

edited

Loading