Runner version 2.320.0 suddenly killing memory heavy workflow (exit code 137) #3521

sroet · 2024-10-22T16:19:34Z

Describe the bug
Since our self-hosted runner updated to version 2.320.0, a memory heavy (but using less than 10% of the available memory for our container) container dies with signal 137 (which seems to be an external kill signal). A unit test workflow of the same code that use less memory still work fine on the same runner.

To Reproduce
Steps to reproduce the behavior:

Go to https://github.com/SBC-Utrecht/pytom-match-pick/actions/workflows/tutorial-tests.yml, scroll down to the successful job 163 and the failing job 164 (both scheduled without any update on our code).

Expected behavior
I expected my workflow to not be killed if it uses ~10% of the available memory

Runner Version and Platform

2.320.0

OS of the machine running the runner? Linux

What's not working?

A weekly heavy test of our code started failing on October 13th after passing on October 6th.

No pushes were done on that branch in that time:

I first assumed it was an updated dependency, so I tried to track it down in SBC-Utrecht/pytom-match-pick#234 .
However, even with an identical dependency set, the workflow still fails with:

Error: Process completed with exit code 137.

Googling that error, it seems like the docker container receives a kill signal, but no OOM signal is found in my syslog and keeping track of the job with docker stats never sees memory usage above 10%
(while unit tests of the same code with the same dependencies, but using less memory, still works).

The one thing I could find that was also different between the two scheduled jobs was the update to version 2.320.0 for the github runner (look for the test-tutorial job, test_suite is the unit test job):

2024-10-06 02:26:07Z: Job test-tutorial completed with result: Succeeded
2024-10-06 02:26:09Z: Running job: test_suite
2024-10-06 02:29:23Z: Job test_suite completed with result: Succeeded
2024-10-07 16:06:11Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
2024-10-07 16:07:20Z: Runner reconnected.
Runner update in progress, do not shutdown runner.
Downloading 2.320.0 runner
Waiting for current job finish running.
Generate and execute update script.
Runner will exit shortly for update, should be back online within 10 seconds.
Runner update process finished.
Runner listener exit because of updating, re-launch runner after successful update
Update finished successfully.
Restarting runner...

√ Connected to GitHub

Current runner version: '2.320.0'
2024-10-07 20:45:48Z: Listening for Jobs
2024-10-07 21:51:31Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
2024-10-07 21:52:48Z: Runner reconnected.
2024-10-08 07:12:04Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
2024-10-08 07:13:23Z: Runner reconnected.
2024-10-08 15:46:21Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
2024-10-08 15:47:32Z: Runner reconnected.
2024-10-09 13:25:44Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
2024-10-09 13:26:52Z: Runner reconnected.
2024-10-11 21:07:23Z: Runner connect error: The HTTP request timed out after 00:01:40.. Retrying until reconnected.
2024-10-13 00:44:26Z: Running job: test-tutorial
2024-10-13 00:47:24Z: Job test-tutorial completed with result: Failed
2024-10-13 01:41:49Z: Running job: test_suite
2024-10-13 01:45:03Z: Job test_suite completed with result: Succeeded

Job Log Output

Error: Process completed with exit code 137.

Runner and Worker's Diagnostic Logs

Working scheduled job worker and runner:
Worker_20241006-004441-utc.log
Runner_20240815-204548-utc.log

Failing scheduled job worker and runner:
Worker_20241013-004427-utc.log
Runner_20241007-204547-utc.log

The text was updated successfully, but these errors were encountered:

sroet · 2024-10-22T16:20:32Z

xref: SBC-Utrecht/pytom-match-pick#232

sroet · 2024-10-31T16:26:31Z

The last run suddenly worked again, so closing this for now

sroet added the bug Something isn't working label Oct 22, 2024

sroet closed this as completed Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runner version 2.320.0 suddenly killing memory heavy workflow (exit code 137) #3521

Runner version 2.320.0 suddenly killing memory heavy workflow (exit code 137) #3521

sroet commented Oct 22, 2024

sroet commented Oct 22, 2024

sroet commented Oct 31, 2024

Runner version 2.320.0 suddenly killing memory heavy workflow (exit code 137) #3521

Runner version 2.320.0 suddenly killing memory heavy workflow (exit code 137) #3521

Comments

sroet commented Oct 22, 2024

Runner Version and Platform

What's not working?

Job Log Output

Runner and Worker's Diagnostic Logs

sroet commented Oct 22, 2024

sroet commented Oct 31, 2024