-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tasks getting killed on Jasmin due to stratify being called from esmvalcore.preprocessor._regrid.extract_levels() preprocessor #3244
Comments
stratify is lazy since v0.3.0 and the You could try installing iris from source (clone the repository and run |
Okay, that's not the problem either! Just loading the data is enough for it to get killed!
Also results in Killed. It's not @bjlittle's stratify's fault. Just loading this data file breaks. |
That's because you're trying to load all the data into memory, maybe it doesn't fit? Try something like import iris
fn = "/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r2i1p1f2/Omon/po4/gn/v20190708/po4_Omon_UKESM1-0-LL_historical_r2i1p1f2_gn_200001-201412.nc"
print('load cube:', fn)
cube= iris.load_cube(fn)
print(cube)
print(cube.core_data()[:,0,:,:]) |
See also ESMValGroup/ESMValCore#2114 |
This works, thanks! .... but this returns a dask array, which is not what I want. I just want to extract the surface layer of a cube, returning a cube (convert 4D -> 3D, or 3D -> 2D). |
Also, I should say that I've tried moving the preprocessor order around and I had the same problem with |
iris=3.6.1 is now available on conda-forge and it gets pulled in our environment, so if you can try regenerating the env and use it, see if that fixes your issue @ledm 🍺 |
Just to confirm my email @valeriupredoi, updating to iris=3.6.1 does not solve this issue. Method:
in ESMValCore:
Then in an interactive python script:
|
Okay, so more investigation: watching top while running the script at the start of this issue results in a huge spike in MEM usage. The file itself is only 2GB, but I've seen up to 8GB using top. Memory being several times larger that the file suggests a memory issue in iris/stratify. This is probably why re-ordering the preprocessors failed me earlier. I had assumed that if I extracted a smaller region first, then the surface layer, it would mean that less memory would be needed (this didn't work!). A memory leak means that it doesn't really matter how small a region you make it, as it will leak and break anyway. |
@ledm here's what I found out: the script you gave me ie import iris
from esmvalcore.preprocessor._regrid import extract_levels
fn = "/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r2i1p1f2/Omon/po4/gn/v20190708/po4_Omon_UKESM1-0-LL_historical_r2i1p1f2_gn_200001-201412.nc"
cube= iris.load_cube(fn)
c2 = extract_levels(cube, scheme='nearest', levels = [0.1 ,]) needs 13G of memory RES (resident) to run to completion; this with:
and the file in question is indeed 2GB but do remember that's a compressed netCDF4 file, usually with a 40% compression factor. So that means that |
the source of this problem is |
Lol, if only it were that easy. This gets killed for me on sci1, sci3, sci4, sci6, and the LOTUS high-mem queue! |
sci2 did the trick for me. We now know where the problem lies, so fixing should follow 😁 |
Okay - running my original recipe (lol not fried chicken!) on sci2 now. Don't know if this is useful information, but it's trying to download 20GB of data from ESGF now. Not sure why it never got there before on sci1. (sci3 isn't connected to ESGF, I don't think) |
Okay, so reverted to ESMValTool 2.8, and iris 3.4. I'm still running out of memory, but at least it's breaking properly, instead of just getting killed:
Calling this a big W. |
Correction. This was on sci2. On sci3, it just got killed the normal way. No idea whats going on. Starting to think it's a jasmin thing. Will try sci6 next. |
Continuing with this, here's a minimal testing recipe. On JASMIN.sci1 this fails for me. If I comment out either recipe, it runs fine. The fact that it works with one dataset but fails with two makes me think that perhaps something isn't being properly closed after it's finished? Or its trying to run two things at once, even when |
The issue mentioned in the top post has been solved in ESMValGroup/ESMValCore#2120 which will be available in the upcoming v2.11.0 release of ESMValCore. I also investigated the recipe in #3244 (comment):
|
@ledm The recipe now runs with the ESMValCore |
With ESMValGroup/ESMValCore#2457 merged regridding is now automatically lazy for data with 2D lat/lon coordinates as well. |
On jasmin, jobs are being killed when the following code runs:
This occurs with several versions of esmvalcore (2.8.0, 2.8.1, 2.9.0).
The error occurs for all four schemes and a range of level values (0.0, 0.1, 0.5)
In all cases, the error occurs here: https://github.com/ESMValGroup/ESMValCore/blob/1101d36e3f343ec823842ea7c3f4b941ee942a89/esmvalcore/preprocessor/_regrid.py#L870
Stratify (version '0.3.0') is a c/python interface wrapper and it previously caused trouble. It is not lazy so it may try to load 120GB files into memory and other issues like that. My previous solution to this problem was the write my own preprocessor:
ESMValGroup/ESMValCore#1039
ESMValGroup/ESMValCore#1048
Which has been abandoned, but I'm tempted to bring it back. (The deadline for this piece of work is 24th july!)
This is an extension of the discussion here: #3239
The text was updated successfully, but these errors were encountered: