Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runner version 2.320.0 suddenly killing memory heavy workflow (exit code 137) #3521

Closed
sroet opened this issue Oct 22, 2024 · 2 comments
Closed
Labels
bug Something isn't working

Comments

@sroet
Copy link

sroet commented Oct 22, 2024

Describe the bug
Since our self-hosted runner updated to version 2.320.0, a memory heavy (but using less than 10% of the available memory for our container) container dies with signal 137 (which seems to be an external kill signal). A unit test workflow of the same code that use less memory still work fine on the same runner.

To Reproduce
Steps to reproduce the behavior:

  1. Go to https://github.com/SBC-Utrecht/pytom-match-pick/actions/workflows/tutorial-tests.yml, scroll down to the successful job 163 and the failing job 164 (both scheduled without any update on our code).

Expected behavior
I expected my workflow to not be killed if it uses ~10% of the available memory

Runner Version and Platform

2.320.0

OS of the machine running the runner? Linux

What's not working?

A weekly heavy test of our code started failing on October 13th after passing on October 6th.
image
No pushes were done on that branch in that time:
image

I first assumed it was an updated dependency, so I tried to track it down in SBC-Utrecht/pytom-match-pick#234 .
However, even with an identical dependency set, the workflow still fails with:

Error: Process completed with exit code 137.

Googling that error, it seems like the docker container receives a kill signal, but no OOM signal is found in my syslog and keeping track of the job with docker stats never sees memory usage above 10%
(while unit tests of the same code with the same dependencies, but using less memory, still works).

The one thing I could find that was also different between the two scheduled jobs was the update to version 2.320.0 for the github runner (look for the test-tutorial job, test_suite is the unit test job):

2024-10-06 02:26:07Z: Job test-tutorial completed with result: Succeeded
2024-10-06 02:26:09Z: Running job: test_suite
2024-10-06 02:29:23Z: Job test_suite completed with result: Succeeded
2024-10-07 16:06:11Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
2024-10-07 16:07:20Z: Runner reconnected.
Runner update in progress, do not shutdown runner.
Downloading 2.320.0 runner
Waiting for current job finish running.
Generate and execute update script.
Runner will exit shortly for update, should be back online within 10 seconds.
Runner update process finished.
Runner listener exit because of updating, re-launch runner after successful update
Update finished successfully.
Restarting runner...

√ Connected to GitHub

Current runner version: '2.320.0'
2024-10-07 20:45:48Z: Listening for Jobs
2024-10-07 21:51:31Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
2024-10-07 21:52:48Z: Runner reconnected.
2024-10-08 07:12:04Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
2024-10-08 07:13:23Z: Runner reconnected.
2024-10-08 15:46:21Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
2024-10-08 15:47:32Z: Runner reconnected.
2024-10-09 13:25:44Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
2024-10-09 13:26:52Z: Runner reconnected.
2024-10-11 21:07:23Z: Runner connect error: The HTTP request timed out after 00:01:40.. Retrying until reconnected.
2024-10-13 00:44:26Z: Running job: test-tutorial
2024-10-13 00:47:24Z: Job test-tutorial completed with result: Failed
2024-10-13 01:41:49Z: Running job: test_suite
2024-10-13 01:45:03Z: Job test_suite completed with result: Succeeded

Job Log Output

Error: Process completed with exit code 137.

Runner and Worker's Diagnostic Logs

Working scheduled job worker and runner:
Worker_20241006-004441-utc.log
Runner_20240815-204548-utc.log

Failing scheduled job worker and runner:
Worker_20241013-004427-utc.log
Runner_20241007-204547-utc.log

@sroet sroet added the bug Something isn't working label Oct 22, 2024
@sroet
Copy link
Author

sroet commented Oct 22, 2024

xref: SBC-Utrecht/pytom-match-pick#232

@sroet
Copy link
Author

sroet commented Oct 31, 2024

The last run suddenly worked again, so closing this for now

@sroet sroet closed this as completed Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant