You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Since our self-hosted runner updated to version 2.320.0, a memory heavy (but using less than 10% of the available memory for our container) container dies with signal 137 (which seems to be an external kill signal). A unit test workflow of the same code that use less memory still work fine on the same runner.
Expected behavior
I expected my workflow to not be killed if it uses ~10% of the available memory
Runner Version and Platform
2.320.0
OS of the machine running the runner? Linux
What's not working?
A weekly heavy test of our code started failing on October 13th after passing on October 6th.
No pushes were done on that branch in that time:
I first assumed it was an updated dependency, so I tried to track it down in SBC-Utrecht/pytom-match-pick#234 .
However, even with an identical dependency set, the workflow still fails with:
Error: Process completed with exit code 137.
Googling that error, it seems like the docker container receives a kill signal, but no OOM signal is found in my syslog and keeping track of the job with docker stats never sees memory usage above 10%
(while unit tests of the same code with the same dependencies, but using less memory, still works).
The one thing I could find that was also different between the two scheduled jobs was the update to version 2.320.0 for the github runner (look for the test-tutorial job, test_suite is the unit test job):
2024-10-06 02:26:07Z: Job test-tutorial completed with result: Succeeded
2024-10-06 02:26:09Z: Running job: test_suite
2024-10-06 02:29:23Z: Job test_suite completed with result: Succeeded
2024-10-07 16:06:11Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
2024-10-07 16:07:20Z: Runner reconnected.
Runner update in progress, do not shutdown runner.
Downloading 2.320.0 runner
Waiting for current job finish running.
Generate and execute update script.
Runner will exit shortly for update, should be back online within 10 seconds.
Runner update process finished.
Runner listener exit because of updating, re-launch runner after successful update
Update finished successfully.
Restarting runner...
√ Connected to GitHub
Current runner version: '2.320.0'
2024-10-07 20:45:48Z: Listening for Jobs
2024-10-07 21:51:31Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
2024-10-07 21:52:48Z: Runner reconnected.
2024-10-08 07:12:04Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
2024-10-08 07:13:23Z: Runner reconnected.
2024-10-08 15:46:21Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
2024-10-08 15:47:32Z: Runner reconnected.
2024-10-09 13:25:44Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
2024-10-09 13:26:52Z: Runner reconnected.
2024-10-11 21:07:23Z: Runner connect error: The HTTP request timed out after 00:01:40.. Retrying until reconnected.
2024-10-13 00:44:26Z: Running job: test-tutorial
2024-10-13 00:47:24Z: Job test-tutorial completed with result: Failed
2024-10-13 01:41:49Z: Running job: test_suite
2024-10-13 01:45:03Z: Job test_suite completed with result: Succeeded
Describe the bug
Since our self-hosted runner updated to version 2.320.0, a memory heavy (but using less than 10% of the available memory for our container) container dies with signal 137 (which seems to be an external kill signal). A unit test workflow of the same code that use less memory still work fine on the same runner.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
I expected my workflow to not be killed if it uses ~10% of the available memory
Runner Version and Platform
2.320.0
OS of the machine running the runner? Linux
What's not working?
A weekly heavy test of our code started failing on October 13th after passing on October 6th.
No pushes were done on that branch in that time:
I first assumed it was an updated dependency, so I tried to track it down in SBC-Utrecht/pytom-match-pick#234 .
However, even with an identical dependency set, the workflow still fails with:
Googling that error, it seems like the docker container receives a kill signal, but no OOM signal is found in my syslog and keeping track of the job with
docker stats
never sees memory usage above 10%(while unit tests of the same code with the same dependencies, but using less memory, still works).
The one thing I could find that was also different between the two scheduled jobs was the update to version 2.320.0 for the github runner (look for the
test-tutorial
job,test_suite
is the unit test job):Job Log Output
Runner and Worker's Diagnostic Logs
Working scheduled job worker and runner:
Worker_20241006-004441-utc.log
Runner_20240815-204548-utc.log
Failing scheduled job worker and runner:
Worker_20241013-004427-utc.log
Runner_20241007-204547-utc.log
The text was updated successfully, but these errors were encountered: