-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recipe testing and output comparison for release 2.9.0 #3239
Comments
Overview of the resultsWebpage: https://esmvaltool.dkrz.de/shared/esmvaltool/v2.9.0rc1/debug.html Successful: 135 out of 153 Recipes that failed with DiagnosticError
Recipes that failed of Missing Data
Recipes that failed because they used too much memory (probably due to SciTools/iris#5338)
|
And here are the results of the comparison with the results for v2.8.0. 61 recipes have different results, which seems like a lot. I will have a preliminary look at them and report some results here later. |
61 is not a lot - it's in fact bang-on, I think I had some 50-60 myself for 2.7 and Remi had a bucketload too; Matplotlib changed so expect plots differing only. Great work, bud 🍻 |
well, rainfarm died bc you didn't install Julia; where are @schlunma 's |
carvalhais looks like a proper bug in iris - |
Both lauer22 didn't actually exit - they seem to have been killed by SLRM mid-analysis, either node failure or memory-related? Same with mpqb and sea surface salinity = gonna update your list there |
Great work with this recipe testing, @bouweandela!
Indeed, we have 7 recipes (last section of this #3239 (comment)) who couldn't be run due to time/memory limitations even if those could be run in the previous release. Possible reasons:
And 4-5 diagnostic failures that we should try to fix together next week 🍻 |
so @ledm has just reported to CEDA/JASMIN that a bunch of his runs get kicked out of SLURM on basis of resources but obv nothing out of the ordinary (ie mem consumption order 2GB) - he's still to confirm that he's using the latest Core, but the symptoms look identical to these recipes here that didn't exit, but rather, were kicked out by the scheduler |
lemme look into this now 🇵🇹 |
OK @ledm is running with latest:
and is seeing exacly what Bouwe saw for lauer22 etc - bizarre kickouts |
Okay, glad it's not just me. Though I think that some of my recent issues originated with #3240 and |
The file that get kills immediately every time on jasmin for me is: https://github.com/ESMValGroup/ESMValTool/blob/09c18280751323354b4f0aa1f0ef6e260589326d/esmvaltool/recipes/ai_mpa/recipe_ocean_ai_mpa_po4.yml |
@bouweandela carvalhais goes through fine with iris=3.6.0 (deployed) and 3.6.x (pip-installed) - that's either a fluke or something related to distributed (which I did not run with); same dask as yours too; try rerunning the recipe, methinks |
I'm still struggling with this. I've changed the preprocessing stage of the recipe (removing custom order), I've tried a few older versions of esmvalcore (2.8.0, 2.8.1, and 2.8.1rc1)? No matter what, I'm still getting the "killed" message after 0.3GB of RAM usage. |
@ledm Would it be OK to open a separate issue for your recipe? This issue is about testing recipes for the upcoming v2.9 release. Re the recipes listed above that are running out of memory: this is probably due to SciTools/iris#5338. |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
I ran the comparison tool again because there was an issue with it. NaN values did not compare as equal in arrays, so the tool flagged too many recipe runs as changed. The new comparison results are available here. Now the number of recipe runs that are different from v2.8.0 is 61. I will add a more detailed comparison once it is ready. |
Detailed comparison, including all kinds of information on what exactly changed in the output is available |
@bouweandela have you rerun this recipe to have it finish OK, by any chance? |
Just did with iris 3.6.1 and I'm still getting the same error. |
By increasing the resources (from interactivate to compute partition, or from compute to compute 512G), I was able to run successfully a few more recipes: |
I re-run some of these recipes with a distributed scheduler using the following cluster:
type: distributed.LocalCluster
n_workers: 8
threads_per_worker: 4
memory_limit: 32 GiB (no idea how optimal those settings are).
The runs that failed exited with errors like Bottom line: I think for the fully lazy preprocessors we can achieve a similar (hopefully a much better with smarter configuration) run time and memory usage when using dask distributed, but the non-lazy preprocessors crash with that. |
I tested |
carvalhais goes through with dask=2023.3.0 but fails a couple lines down the diag at plotting:
so cartopy is breaking it now |
the problem with downgrading dask is that we actually have to isolate a bug at their end, but beats me if I could do that - what I did was I saved the scalar cube that was causing the issue, and tried to repeat the operation that was not liked by dask: import iris
def make_cube():
"""Make a scalar cube and save it."""
cube = iris.load_cube("tauxxx.nc")
return cube
def main():
"""Run the show."""
cube = make_cube()
cube.data = cube.lazy_data()
print(cube.has_lazy_data())
print(cube)
x = float(cube.core_data())
print(x)
if __name__ == "__main__":
main() problem is, that works well, no issue. Note that I needed to force the lazy data bc the cube in the diag has indeed lazy data, whereas if you load an already scalar cube it has realized data. Am stumped 😖 |
this is signaled and raised here SciTools/cartopy#2199 and it looks like they already have a PR to fix it 🥳 |
Summary of relevant differences: Recipes where the plots are identical, but (some) data differs
For the above recipes, I suspect we might want to try if we can make the comparison if the data is the same more tolerant, since the plots are considered identical enough. The recipes below need a check by the @ESMValGroup/esmvaltool-recipe-maintainers (please let us know if you are no longer interested in being a maintainer, e.g. by commenting in this issue): Recipes where (some) plots differ, but not the data
(Some) plots and data are different
The results from running the recipes with v2.8.0 of the tool are available here and the results from running them with v2.9.0 of the tool are available here. You can see very specifically what is different in the comparison tool output files: detailed comparison part 1 and detailed comparison part 2. |
@ESMValGroup/technical-lead-development-team Because we're seeing a considerable performance regression for some recipes when using the basic/default Dask scheduler, I would propose that we make the release with iris pinned to v3.4 or greater instead of 3.6 or later. That would allow users who are really keen on using the basic scheduler to do so. Please let me know if there's objections to that. |
Out of interest, what was the last ESMValTool version that uses iris 3.4? |
For the release of v2.8.0, we used iris-3.4.1 to test all recipes |
@bouweandela thanks for the list! The plot from recipe_spei.yml only changed the colour (because R chooses different ones every time, I should set fixed ones there but I guess there is not enough time to change that before the release now). |
Thanks for checking @katjaweigel! I was planning to make the release on Thursday next week, so if you still want to fix the plotting in recipe_spei.yml before that time you can do it. |
Regarding the list of "Recipes that failed because they used too much memory" (in this #3239 (comment)), I was able to run them successfully on Levante, except one, by increasing the computational resources required (see #3216). This does not really tell us where the problems come from but will nevertheless help the RM who gets less failures. The biggest increase of resources required is for the 2 cloud recipes. Using 1024 GB compute nodes and |
We did another round of test runs. The results are available here: https://esmvaltool.dkrz.de/shared/esmvaltool/v2.9.0/ Here is the environment.yml and here are the comparison results and the detailed comparison results. The comparison was done with the previous round of testing for this release. Because of disk space limitations, I had to remove the v2.9.0rc1 results from the webpage, but the results are still available in All recipes ran successfully except for:
|
@bouweandela here's your Portuguese issue, bud #3281 |
I think we can safely close this now that we have #3287 - @bouweandela @remi-kazeroni since you put so much effort in it, I'd feel bad closing it myself, will you do the service pls 🎖️ |
Hi, I've been away on vacation and didn't see the mention to check differing plots before now. To me it looks like only consecutive dry days have changed, and I wondered if it's just that the version of consecutive dry days has changed from the one that caps the number at the end of a year to the one that runs over multiple years when the longest dry period extends across calendar years. That would explain the plot discrepancy, they weird end of the original plot (in 2.8), and would anyway be the more sensible thing to have plotted this way. I.e. I think that this is an improvement, then. |
This issue documents recipe testing for the v2.9.0 release
Installation and environment
Environment file
environment.txt
Config user file
All options at the default for Levante except
search_esgf: when_missing
.The text was updated successfully, but these errors were encountered: