normalize-data add_alpha step is failing #802

edsu · 2024-02-13T21:15:29Z

As you can see in druid:kg552qb0295 that the normalize_data step is throwing an error (see the related honeybadger alert):

Error: normalize-data : normalize-data: could not execute command successfully: false: /usr/bin/gdalwarp -co 'COMPRESS=LZW' -dstalpha /var/geomdtk/current/tmp/normalize_kg552qb0295/EPSG_4326/20220626_215939_ssc6_u0002_pansharpened_file_format.tif /var/geomdtk/current/tmp/normalize_kg552qb0295/EPSG_4326/20220626_215939_ssc6_u0002_pansharpened_file_format_alpha.tif

It appears that the input file being given to gdalwarp doesn't exist, but perhaps the temporary directory has already been cleaned up?

To replicate the problem:

register a new GIS item in Argo
ssh to lyberadmin@kurma-robots-stage-01
`cp -r /gis_workflow_data/stage/kg552qb0295/ /gis_workflow_data/stage/{your-new-druid}
Use Argo to add Workflow gisAssessionWF
Wait for the error in the workflow.

Note: this may be related to #658 since the item includes a COG GeoTIFF?

The text was updated successfully, but these errors were encountered:

jmartin-sul · 2024-02-14T19:24:07Z

re-running normalize-data on kg552qb0295 on stage since main has been re-deployed, in an attempt to see whether this still happens when #799 isn't in play.

jmartin-sul · 2024-02-14T19:34:17Z

normalize-data succeeded for https://argo-stage.stanford.edu/view/druid:kg552qb0295 with main deployed. so that's interesting. a copy of the file that ran into the gdalwarp error when compression was enabled (/var/geomdtk/current/tmp/normalize_kg552qb0295/EPSG_4326/20220626_215939_ssc6_u0002_pansharpened_file_format.tif) should be available for further testing at /var/geomdtk/current/tmp/jm_20220626_215939_ssc6_u0002_pansharpened_file_format.tif on kurma-robots-stage-01.

before copying that input file from the failed gdalwarp invocation, i made sure it was done writing (by confirming that the output file was being written, and that the timestamp of the input file was no longer changing, and that the md5 matched for my copy and the original temp file a few minutes after the copy was made).

will do some testing with gdalwarp on that saved copy.

the successful normalize-data execution in my test just now took 7921 seconds (about 132 minutes).

jmartin-sul · 2024-02-15T18:54:18Z

In doing some manual testing yesterday on the copy I made of the file that ran into this issue, I saw the following output from gdalwarp before it quit:

lyberadmin@kurma-robots-stage-01:~/gis-robot-suite/current$ /usr/bin/gdalwarp -co 'COMPRESS=LZW' -dstalpha /var/geomdtk/current/tmp/jm_20220626_215939_ssc6_u0002_pansharpened_file_format.tif /var/geomdtk/current/tmp/jm_20220626_215939_ssc6_u0002_pansharpened_file_format_alpha.tif
...stuff i couldn't see because i did this in a screen session and it scrolled off...
ERROR 1: TIFFAppendToStrip:Maximum TIFF file size exceeded. Use BIGTIFF=YES creation option.
ERROR 1: TIFFAppendToStrip:Maximum TIFF file size exceeded. Use BIGTIFF=YES creation option.
ERROR 1: TIFFAppendToStrip:Maximum TIFF file size exceeded. Use BIGTIFF=YES creation option.
...lots more of that...
ERROR 1: TIFFAppendToStrip:Maximum TIFF file size exceeded. Use BIGTIFF=YES creation option.
More than 1000 errors or warnings have been reported. No more will be reported from now.

The _alpha.tif output file was exactly 4 GB when gdalwarp quit:

lyberadmin@kurma-robots-stage-01:~/gis-robot-suite/current$ ls -lah /var/geomdtk/current/tmp
total 6.7G
drwxr-xr-x 2 lyberadmin lyberteam 4.0K Feb 14 18:20 .
drwxr-xr-x 5 lyberadmin lyberteam 4.0K Jul 29  2022 ..
-rw-r--r-- 1 lyberadmin lyberteam 4.0G Feb 14 18:35 jm_20220626_215939_ssc6_u0002_pansharpened_file_format_alpha.tif
-rw-r--r-- 1 lyberadmin lyberteam 2.7G Feb 14 09:57 jm_20220626_215939_ssc6_u0002_pansharpened_file_format.tif
lyberadmin@kurma-robots-stage-01:~/gis-robot-suite/current$

A random stack overflow post suggests that -co BIGTIFF=YES might fix the issue. I'll look at GDAL docs to see how to specify two creation options, since -co is also how we specify using compression.

A seemingly relevant doc suggests using -co flag multiple times -- https://gdal.org/drivers/raster/gtiff.html, e.g.:

gdal_translate in.jpg out.tif -co COMPRESS=JPEG -co TILED=YES

Another possibly useful few links:
https://gis.stackexchange.com/questions/307452/error-while-creating-large-tiff-with-gdal-merge
https://gis.stackexchange.com/questions/402954/use-bigtiff-yes-creation-option-not-working
https://gdal.org/programs/gdalwarp.html#cmdoption-gdalwarp-co

Also worth noting: I monitored system resource usage using htop while my test gdalwarp invocation was running. CPU usage was reasonable (one CPU pretty fully consumed for most of the test) and RAM usage was also reasonable -- went from using about 1.1 GB when idle, to between 1.4 and 1.6 GB of RAM while gdalwarp was running. @edsu and I thought yesterday that the system might be under resourced, and that we might want to bump it up from its current 5.8 GB of RAM. but that doesn't seem to be the crux of this bug. Though we might still want to up the resources, since 5.8 GB is pretty low? OTOH, the gis-robot-suite boxes don't run a ton of workers (each appears to run one single-threaded sidekiq process, based on a quick glance at stage and prod robot-console).

edsu · 2024-02-16T16:44:54Z

This all sounds good -- thanks for the testing! I wonder if it will work with the additional option? I guess it'll take even longer lol :-)

jmartin-sul · 2024-02-16T21:35:02Z

Using the change in 17b104e, I was able to retry the normalize-data step on https://argo-stage.stanford.edu/view/druid:kx853bc3354 and see it successfully complete. Or at least, complete without erroring out. We discussed in standup this morning, and @kimdurante will verify that the alpha channel add actually worked as desired. Bonus, if we indeed got the desired output: the slightly more circuitous revised approach is significantly faster than using gdalwarp alone.

gory detail about what motivated this new approach...

I did more manual testing with gdalwarp etc on jm_20220626_215939_ssc6_u0002_pansharpened_file_format.tif on the CLI yesterday, and this is what I found...

First, I tried: /usr/bin/gdalwarp -co 'COMPRESS=LZW' -co 'BIGTIFF=YES' -dstalpha /var/geomdtk/current/tmp/jm_20220626_215939_ssc6_u0002_pansharpened_file_format.tif /var/geomdtk/current/tmp/jm_20220626_215939_ssc6_u0002_pansharpened_file_format_alpha-compress-bigtiff.tif

Unfortunately I wasn't tracking timing closely, but at what felt like about an hour into that operation, the output file with the alpha channel was 47 GB, the progress meter indicated that it wasn't yet 50% done, and was clearly unlikely to fit in the available space (62 GB!). So I cancelled execution and deleted the output.

Then I retried the original (successful on kg552qb0295) command just for good measure (/usr/bin/gdalwarp -dstalpha /var/geomdtk/current/tmp/jm_20220626_215939_ssc6_u0002_pansharpened_file_format.tif /var/geomdtk/current/tmp/jm_20220626_215939_ssc6_u0002_pansharpened_file_format_alpha.tif). That is, no compression, no forcing of BIGTIFF... this again completed successfully, and as it seemed when I was watching the robot do it via Sidekiq, the operation took just shy of 90 min, and produced a 13 GB tif (docs indicate that the default is BIGTIFF=IF_NEEDED, so it makes sense that it was able to generate a 13 GB file without an explicit BigTIFF flag). Also of note: the memory usage never seemed to climb past about 1.5 or 1.6 GB (idle memory usage on the VM seems to hover right around 1 GB).

Given the above, I thought I'd see if anyone else had run into the surprising behavior of much much larger files being generated when the compress option is given to gdalwarp... and indeed, it's a known thing. This Stack Exchange post led me to the new approach of using gdalwarp to create a vrt with the alpha channel, with a call to gdal_translate to create a new compressed TIFF with the alpha channel: https://gis.stackexchange.com/questions/89444/file-size-inflation-normal-with-gdalwarp (see also https://trac.osgeo.org/gdal/wiki/UserDocs/GdalWarp#GeoTIFFoutput-coCOMPRESSisbroken)

I settled on piping the output from one command to the other using GDAL's virtual file system, e.g. (when manually testing yesterday) gdalwarp -dstalpha -of vrt /var/geomdtk/current/tmp/jm_20220626_215939_ssc6_u0002_pansharpened_file_format.tif /vsistdout/ | gdal_translate -co compress=LZW /vsistdin/ /var/geomdtk/current/tmp/jm_20220626_215939_ssc6_u0002_pansharpened_file_format_alpha-via-vrt-piped.tif

That new TIFF is 3.1 GB. The gdalwarp -dstalpha -of vrt takes about a second, and the gdal_translate -co compress=LZW takes about 3 minutes. Some of the other normalize-data operations are expensive too, so it still takes a while. But when I applied this change to the robot code and redeployed and re-ran normalize-data on the remaining errored object, this change alone gets the total time for the step down from 7921 seconds to 2564 seconds. Hopefully it actually produces the desired output!

jmartin-sul · 2024-02-16T22:17:24Z

one other observation about memory usage: both vrt based approaches (explicit output file for vrt vs piping) seem to stay entirely under 1.4 GB mem consumption.

given that there are a max of 3 robot threads likely to be running at any given time, and given that these normalization operations seem to consume about 500 MB of memory each over the 1 GB baseline usage, i'd guess that 5.8 GB total should be fine for now? i.e., no need to up the RAM on the gis-robot-suite VMs just yet, i don't think.

edsu · 2024-02-20T20:46:49Z

Unfortunately the two step approach outlined above gdalwarp | gdal_translate for adding the alpha channel didn't seem to add an alpha channel.

During a huddle today (@jmartin-sul, @kimdurante, @edsu and @aaron-collier) we decided to back out the addition of the alpha channel since it is not having the desired effect. This effectively closes this ticket, at least for now.

…gdalwarp ref #789 ref #802 uses a more circuitous way of adding alpha channel (gdalwarp to add alpha channel and output to vrt, then pipe that to gdal_translate for compression). this provides better compression and faster computation than solely using gdalwarp.

edsu added this to Geo Workcycles 2024 Feb 13, 2024

edsu converted this from a draft issue Feb 13, 2024

jmartin-sul self-assigned this Feb 14, 2024

jmartin-sul moved this from Ready to In Progress in Geo Workcycles 2024 Feb 14, 2024

This was referenced Feb 16, 2024

RasterNormalizer#add_alpha_channel: tweak gdalwarp command so it adds the alpha channel correctly and uses compression when creating the output file #799

Draft

calls to system utilities should report the contents of stderr when the call fails #820

Closed

edsu closed this as completed Feb 20, 2024

github-project-automation bot moved this from In Progress to Done in Geo Workcycles 2024 Feb 20, 2024

edsu mentioned this issue Jan 26, 2024

Create a transparency layer (alpha channel) for WGS84 derivative GeoTIFFs if one does not already exist #570

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

normalize-data add_alpha step is failing #802

normalize-data add_alpha step is failing #802

edsu commented Feb 13, 2024 •

edited by jmartin-sul

Loading

jmartin-sul commented Feb 14, 2024

jmartin-sul commented Feb 14, 2024

jmartin-sul commented Feb 15, 2024 •

edited

Loading

edsu commented Feb 16, 2024

jmartin-sul commented Feb 16, 2024 •

edited

Loading

jmartin-sul commented Feb 16, 2024

edsu commented Feb 20, 2024

normalize-data add_alpha step is failing #802

normalize-data add_alpha step is failing #802

Comments

edsu commented Feb 13, 2024 • edited by jmartin-sul Loading

jmartin-sul commented Feb 14, 2024

jmartin-sul commented Feb 14, 2024

jmartin-sul commented Feb 15, 2024 • edited Loading

edsu commented Feb 16, 2024

jmartin-sul commented Feb 16, 2024 • edited Loading

jmartin-sul commented Feb 16, 2024

edsu commented Feb 20, 2024

edsu commented Feb 13, 2024 •

edited by jmartin-sul

Loading

jmartin-sul commented Feb 15, 2024 •

edited

Loading

jmartin-sul commented Feb 16, 2024 •

edited

Loading