-
-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-72163] Retry on initial connection failure occurs in one entrypoint but not the other #675
Conversation
…rypoint but not the other
} catch (Exception e) { | ||
if (Boolean.getBoolean(Engine.class.getName() + ".nonFatalJnlpAgentEndpointResolutionExceptions")) { | ||
events.status("Could not resolve JNLP agent endpoint", e); | ||
} catch (IOException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor cleanup while I was here. The original code was catching any exception type, including InterruptedException
(which it made no attempt to handle). Following generic advice such as that given on this page, prefer specific exceptions in catch blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hardcoded 10 seconds sleep doesn't seem ideal but all the pre-existing code does it so 👍
and thanks for the follow up PR at #676 that fixes the hardcoded sleep issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
if (Boolean.getBoolean(Engine.class.getName() + ".nonFatalJnlpAgentEndpointResolutionExceptions")) { | ||
events.status("Could not resolve JNLP agent endpoint", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to readers excluding the author (not a review): this property derives from #449, where SwarmClient
wraps hudson.remoting.jnlp.Main
in-JVM and thus the call to System.exit
would be “fatal”. For normal uses of Remoting, the distinction is merely between logging a stack trace at SEVERE
and exiting with -1 vs. logging a stack trace at INFO
and exiting with 0 (but still effectively treating the error as fatal).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thus the call to System.exit would be “fatal”
Not sure why scare quotes were used here, but yes a call to System.exit
is fatal to the life of the process.
Context
See JENKINS-72163.
Problem
I talked to a user running Kubernetes agents in a cluster where the controller was not immediately reachable over the network after spinning up the agent. Rather, it took 30 seconds or so for the controller to become reachable over the network. While admitting this networking setup was not ideal, the user expected Remoting to be resilient to this scenario, but it was not. Rather, Remoting printed the following exception and then terminated with a non-zero exit code, never trying again:
Evaluation
There are two
public static void main()
entrypoints into Remoting:hudson.remoting.Launcher
(used byjava -jar remoting.jar -jnlpUrl <…>
) andhudson.remoting.jnlp.Main
(used byjava -cp remoting.jar hudson.remoting.jnlp.Main <…> -url <…>
), which was the entrypoint being used by this user. The first entrypoint is a thin wrapper around the second when-jnlpUrl
is passed in, and if the controller is not available it keeps retrying every 10 seconds (unless-noReconnect
is specified) until the controller is available before vectoring into the second entrypoint. If the connection is interrupted after it is established and the controller is not immediately available for reconnection, we again retry every 10 seconds (unlessnoReconnect
is specified). But there is a gap in retry coverage—if the second entrypoint is invoked directly (rather than via the first entrypoint) and the controller is not available at the time the initial connection is made, as was the case with this user, no retries will occur.Solution
When
-noReconnect
is not specified, sleep the usual interval and loop back around to retry rather than terminating. This behavior is consistent with the behavior when running via the first entrypoint, which does this before vectoring into the second (at which point the controller should already be reachable). It is also consistent with the behavior that occurs after an existing connection is interrupted and the controller is not immediately available for reconnection. When-noReconnect
is specified, preserve the existing behavior of either terminating fatally under normal scenarios or terminating non-fatally when running under Swarm (to allow Swarm to do its own exponential backoff implementation instead).Implementation
We have made no attempt to clean up the rather complex control flow and duplicate code in this module. We did, however, leave notes that might aid a future refactoring effort.
Notes to Reviewers
Please review with the Hide Whitespace feature enabled.
Testing Done
Reproduced the problem by shutting down the controller and attempting to connect via the second entrypoint directly. Received the exception and a fatal termination before this PR, while after this PR Remoting would sleep 10 seconds until the controller was back online and the connection succeeded. Also verified that the existing behavior, including fatal exit code, was preserved when running with
-noReconnect
.