Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

normalize-data add_alpha step is failing #802

Closed
edsu opened this issue Feb 13, 2024 · 7 comments · May be fixed by #799
Closed

normalize-data add_alpha step is failing #802

edsu opened this issue Feb 13, 2024 · 7 comments · May be fixed by #799
Assignees

Comments

@edsu
Copy link
Contributor

edsu commented Feb 13, 2024

As you can see in druid:kg552qb0295 that the normalize_data step is throwing an error (see the related honeybadger alert):

Error: normalize-data : normalize-data: could not execute command successfully: false: /usr/bin/gdalwarp -co 'COMPRESS=LZW' -dstalpha /var/geomdtk/current/tmp/normalize_kg552qb0295/EPSG_4326/20220626_215939_ssc6_u0002_pansharpened_file_format.tif /var/geomdtk/current/tmp/normalize_kg552qb0295/EPSG_4326/20220626_215939_ssc6_u0002_pansharpened_file_format_alpha.tif 

It appears that the input file being given to gdalwarp doesn't exist, but perhaps the temporary directory has already been cleaned up?

To replicate the problem:

  1. register a new GIS item in Argo
  2. ssh to lyberadmin@kurma-robots-stage-01
  3. `cp -r /gis_workflow_data/stage/kg552qb0295/ /gis_workflow_data/stage/{your-new-druid}
  4. Use Argo to add Workflow gisAssessionWF
  5. Wait for the error in the workflow.

Note: this may be related to #658 since the item includes a COG GeoTIFF?

@edsu edsu converted this from a draft issue Feb 13, 2024
@jmartin-sul jmartin-sul self-assigned this Feb 14, 2024
@jmartin-sul jmartin-sul moved this from Ready to In Progress in Geo Workcycles 2024 Feb 14, 2024
@jmartin-sul
Copy link
Member

re-running normalize-data on kg552qb0295 on stage since main has been re-deployed, in an attempt to see whether this still happens when #799 isn't in play.

@jmartin-sul
Copy link
Member

normalize-data succeeded for https://argo-stage.stanford.edu/view/druid:kg552qb0295 with main deployed. so that's interesting. a copy of the file that ran into the gdalwarp error when compression was enabled (/var/geomdtk/current/tmp/normalize_kg552qb0295/EPSG_4326/20220626_215939_ssc6_u0002_pansharpened_file_format.tif) should be available for further testing at /var/geomdtk/current/tmp/jm_20220626_215939_ssc6_u0002_pansharpened_file_format.tif on kurma-robots-stage-01.

before copying that input file from the failed gdalwarp invocation, i made sure it was done writing (by confirming that the output file was being written, and that the timestamp of the input file was no longer changing, and that the md5 matched for my copy and the original temp file a few minutes after the copy was made).

will do some testing with gdalwarp on that saved copy.

the successful normalize-data execution in my test just now took 7921 seconds (about 132 minutes).

@jmartin-sul
Copy link
Member

jmartin-sul commented Feb 15, 2024

In doing some manual testing yesterday on the copy I made of the file that ran into this issue, I saw the following output from gdalwarp before it quit:

lyberadmin@kurma-robots-stage-01:~/gis-robot-suite/current$ /usr/bin/gdalwarp -co 'COMPRESS=LZW' -dstalpha /var/geomdtk/current/tmp/jm_20220626_215939_ssc6_u0002_pansharpened_file_format.tif /var/geomdtk/current/tmp/jm_20220626_215939_ssc6_u0002_pansharpened_file_format_alpha.tif
...stuff i couldn't see because i did this in a screen session and it scrolled off...
ERROR 1: TIFFAppendToStrip:Maximum TIFF file size exceeded. Use BIGTIFF=YES creation option.
ERROR 1: TIFFAppendToStrip:Maximum TIFF file size exceeded. Use BIGTIFF=YES creation option.
ERROR 1: TIFFAppendToStrip:Maximum TIFF file size exceeded. Use BIGTIFF=YES creation option.
...lots more of that...
ERROR 1: TIFFAppendToStrip:Maximum TIFF file size exceeded. Use BIGTIFF=YES creation option.
More than 1000 errors or warnings have been reported. No more will be reported from now.

The _alpha.tif output file was exactly 4 GB when gdalwarp quit:

lyberadmin@kurma-robots-stage-01:~/gis-robot-suite/current$ ls -lah /var/geomdtk/current/tmp
total 6.7G
drwxr-xr-x 2 lyberadmin lyberteam 4.0K Feb 14 18:20 .
drwxr-xr-x 5 lyberadmin lyberteam 4.0K Jul 29  2022 ..
-rw-r--r-- 1 lyberadmin lyberteam 4.0G Feb 14 18:35 jm_20220626_215939_ssc6_u0002_pansharpened_file_format_alpha.tif
-rw-r--r-- 1 lyberadmin lyberteam 2.7G Feb 14 09:57 jm_20220626_215939_ssc6_u0002_pansharpened_file_format.tif
lyberadmin@kurma-robots-stage-01:~/gis-robot-suite/current$ 

A random stack overflow post suggests that -co BIGTIFF=YES might fix the issue. I'll look at GDAL docs to see how to specify two creation options, since -co is also how we specify using compression.

A seemingly relevant doc suggests using -co flag multiple times -- https://gdal.org/drivers/raster/gtiff.html, e.g.:

gdal_translate in.jpg out.tif -co COMPRESS=JPEG -co TILED=YES

Another possibly useful few links:
https://gis.stackexchange.com/questions/307452/error-while-creating-large-tiff-with-gdal-merge
https://gis.stackexchange.com/questions/402954/use-bigtiff-yes-creation-option-not-working
https://gdal.org/programs/gdalwarp.html#cmdoption-gdalwarp-co

Also worth noting: I monitored system resource usage using htop while my test gdalwarp invocation was running. CPU usage was reasonable (one CPU pretty fully consumed for most of the test) and RAM usage was also reasonable -- went from using about 1.1 GB when idle, to between 1.4 and 1.6 GB of RAM while gdalwarp was running. @edsu and I thought yesterday that the system might be under resourced, and that we might want to bump it up from its current 5.8 GB of RAM. but that doesn't seem to be the crux of this bug. Though we might still want to up the resources, since 5.8 GB is pretty low? OTOH, the gis-robot-suite boxes don't run a ton of workers (each appears to run one single-threaded sidekiq process, based on a quick glance at stage and prod robot-console).

@edsu
Copy link
Contributor Author

edsu commented Feb 16, 2024

This all sounds good -- thanks for the testing! I wonder if it will work with the additional option? I guess it'll take even longer lol :-)

@jmartin-sul
Copy link
Member

jmartin-sul commented Feb 16, 2024

Using the change in 17b104e, I was able to retry the normalize-data step on https://argo-stage.stanford.edu/view/druid:kx853bc3354 and see it successfully complete. Or at least, complete without erroring out. We discussed in standup this morning, and @kimdurante will verify that the alpha channel add actually worked as desired. Bonus, if we indeed got the desired output: the slightly more circuitous revised approach is significantly faster than using gdalwarp alone.

gory detail about what motivated this new approach...

I did more manual testing with gdalwarp etc on jm_20220626_215939_ssc6_u0002_pansharpened_file_format.tif on the CLI yesterday, and this is what I found...

First, I tried: /usr/bin/gdalwarp -co 'COMPRESS=LZW' -co 'BIGTIFF=YES' -dstalpha /var/geomdtk/current/tmp/jm_20220626_215939_ssc6_u0002_pansharpened_file_format.tif /var/geomdtk/current/tmp/jm_20220626_215939_ssc6_u0002_pansharpened_file_format_alpha-compress-bigtiff.tif

Unfortunately I wasn't tracking timing closely, but at what felt like about an hour into that operation, the output file with the alpha channel was 47 GB, the progress meter indicated that it wasn't yet 50% done, and was clearly unlikely to fit in the available space (62 GB!). So I cancelled execution and deleted the output.

Then I retried the original (successful on kg552qb0295) command just for good measure (/usr/bin/gdalwarp -dstalpha /var/geomdtk/current/tmp/jm_20220626_215939_ssc6_u0002_pansharpened_file_format.tif /var/geomdtk/current/tmp/jm_20220626_215939_ssc6_u0002_pansharpened_file_format_alpha.tif). That is, no compression, no forcing of BIGTIFF... this again completed successfully, and as it seemed when I was watching the robot do it via Sidekiq, the operation took just shy of 90 min, and produced a 13 GB tif (docs indicate that the default is BIGTIFF=IF_NEEDED, so it makes sense that it was able to generate a 13 GB file without an explicit BigTIFF flag). Also of note: the memory usage never seemed to climb past about 1.5 or 1.6 GB (idle memory usage on the VM seems to hover right around 1 GB).

Given the above, I thought I'd see if anyone else had run into the surprising behavior of much much larger files being generated when the compress option is given to gdalwarp... and indeed, it's a known thing. This Stack Exchange post led me to the new approach of using gdalwarp to create a vrt with the alpha channel, with a call to gdal_translate to create a new compressed TIFF with the alpha channel: https://gis.stackexchange.com/questions/89444/file-size-inflation-normal-with-gdalwarp (see also https://trac.osgeo.org/gdal/wiki/UserDocs/GdalWarp#GeoTIFFoutput-coCOMPRESSisbroken)

I settled on piping the output from one command to the other using GDAL's virtual file system, e.g. (when manually testing yesterday) gdalwarp -dstalpha -of vrt /var/geomdtk/current/tmp/jm_20220626_215939_ssc6_u0002_pansharpened_file_format.tif /vsistdout/ | gdal_translate -co compress=LZW /vsistdin/ /var/geomdtk/current/tmp/jm_20220626_215939_ssc6_u0002_pansharpened_file_format_alpha-via-vrt-piped.tif

That new TIFF is 3.1 GB. The gdalwarp -dstalpha -of vrt takes about a second, and the gdal_translate -co compress=LZW takes about 3 minutes. Some of the other normalize-data operations are expensive too, so it still takes a while. But when I applied this change to the robot code and redeployed and re-ran normalize-data on the remaining errored object, this change alone gets the total time for the step down from 7921 seconds to 2564 seconds. Hopefully it actually produces the desired output!

@jmartin-sul
Copy link
Member

one other observation about memory usage: both vrt based approaches (explicit output file for vrt vs piping) seem to stay entirely under 1.4 GB mem consumption.

given that there are a max of 3 robot threads likely to be running at any given time, and given that these normalization operations seem to consume about 500 MB of memory each over the 1 GB baseline usage, i'd guess that 5.8 GB total should be fine for now? i.e., no need to up the RAM on the gis-robot-suite VMs just yet, i don't think.

@edsu
Copy link
Contributor Author

edsu commented Feb 20, 2024

Unfortunately the two step approach outlined above gdalwarp | gdal_translate for adding the alpha channel didn't seem to add an alpha channel.

Screenshot 2024-02-20 at 3 35 22 PM

During a huddle today (@jmartin-sul, @kimdurante, @edsu and @aaron-collier) we decided to back out the addition of the alpha channel since it is not having the desired effect. This effectively closes this ticket, at least for now.

@edsu edsu closed this as completed Feb 20, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Geo Workcycles 2024 Feb 20, 2024
jmartin-sul added a commit that referenced this issue Feb 21, 2024
…gdalwarp

ref #789
ref #802

uses a more circuitous way of adding alpha channel (gdalwarp to add alpha channel and output to vrt, then pipe that to gdal_translate for compression).  this provides better compression and faster computation than solely using gdalwarp.
jmartin-sul added a commit that referenced this issue Feb 21, 2024
…gdalwarp

ref #789
ref #802

uses a more circuitous way of adding alpha channel (gdalwarp to add alpha channel and output to vrt, then pipe that to gdal_translate for compression).  this provides better compression and faster computation than solely using gdalwarp.
jmartin-sul added a commit that referenced this issue Feb 21, 2024
…gdalwarp

ref #789
ref #802

uses a more circuitous way of adding alpha channel (gdalwarp to add alpha channel and output to vrt, then pipe that to gdal_translate for compression).  this provides better compression and faster computation than solely using gdalwarp.
jmartin-sul added a commit that referenced this issue Feb 26, 2024
…gdalwarp

ref #789
ref #802

uses a more circuitous way of adding alpha channel (gdalwarp to add alpha channel and output to vrt, then pipe that to gdal_translate for compression).  this provides better compression and faster computation than solely using gdalwarp.
jmartin-sul added a commit that referenced this issue Feb 26, 2024
…gdalwarp

ref #789
ref #802

uses a more circuitous way of adding alpha channel (gdalwarp to add alpha channel and output to vrt, then pipe that to gdal_translate for compression).  this provides better compression and faster computation than solely using gdalwarp.
jmartin-sul added a commit that referenced this issue Feb 26, 2024
…gdalwarp

ref #789
ref #802

uses a more circuitous way of adding alpha channel (gdalwarp to add alpha channel and output to vrt, then pipe that to gdal_translate for compression).  this provides better compression and faster computation than solely using gdalwarp.
jmartin-sul added a commit that referenced this issue Feb 27, 2024
…gdalwarp

ref #789
ref #802

uses a more circuitous way of adding alpha channel (gdalwarp to add alpha channel and output to vrt, then pipe that to gdal_translate for compression).  this provides better compression and faster computation than solely using gdalwarp.
jmartin-sul added a commit that referenced this issue Mar 20, 2024
…gdalwarp

ref #789
ref #802

uses a more circuitous way of adding alpha channel (gdalwarp to add alpha channel and output to vrt, then pipe that to gdal_translate for compression).  this provides better compression and faster computation than solely using gdalwarp.
jmartin-sul added a commit that referenced this issue Mar 26, 2024
…gdalwarp

ref #789
ref #802

uses a more circuitous way of adding alpha channel (gdalwarp to add alpha channel and output to vrt, then pipe that to gdal_translate for compression).  this provides better compression and faster computation than solely using gdalwarp.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
2 participants