Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-72163] Retry on initial connection failure occurs in one entrypoint but not the other #675

Merged
merged 1 commit into from
Oct 12, 2023

Conversation

basil
Copy link
Member

@basil basil commented Oct 11, 2023

Context

See JENKINS-72163.

Problem

I talked to a user running Kubernetes agents in a cluster where the controller was not immediately reachable over the network after spinning up the agent. Rather, it took 30 seconds or so for the controller to become reachable over the network. While admitting this networking setup was not ideal, the user expected Remoting to be resilient to this scenario, but it was not. Rather, Remoting printed the following exception and then terminated with a non-zero exit code, never trying again:

java.io.IOException: Failed to connect to http://example.com/tcpSlaveAgentListener/: Connection refused
	at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:216)
	at hudson.remoting.Engine.innerRun(Engine.java:761)
	at hudson.remoting.Engine.run(Engine.java:543)
Caused by: java.net.ConnectException: Connection refused
	at java.base/sun.nio.ch.Net.pollConnect(Native Method)
	at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
	at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:547)
	at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:602)
	at java.base/java.net.Socket.connect(Socket.java:633)
	at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:178)
	at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:533)
	at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:638)
	at java.base/sun.net.www.http.HttpClient.<init>(HttpClient.java:281)
	at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:386)
	at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:408)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1309)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1242)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1128)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:1057)
	at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:213)
	... 2 more

Evaluation

There are two public static void main() entrypoints into Remoting: hudson.remoting.Launcher (used by java -jar remoting.jar -jnlpUrl <…>) and hudson.remoting.jnlp.Main (used by java -cp remoting.jar hudson.remoting.jnlp.Main <…> -url <…>), which was the entrypoint being used by this user. The first entrypoint is a thin wrapper around the second when -jnlpUrl is passed in, and if the controller is not available it keeps retrying every 10 seconds (unless -noReconnect is specified) until the controller is available before vectoring into the second entrypoint. If the connection is interrupted after it is established and the controller is not immediately available for reconnection, we again retry every 10 seconds (unless noReconnect is specified). But there is a gap in retry coverage—if the second entrypoint is invoked directly (rather than via the first entrypoint) and the controller is not available at the time the initial connection is made, as was the case with this user, no retries will occur.

Solution

When -noReconnect is not specified, sleep the usual interval and loop back around to retry rather than terminating. This behavior is consistent with the behavior when running via the first entrypoint, which does this before vectoring into the second (at which point the controller should already be reachable). It is also consistent with the behavior that occurs after an existing connection is interrupted and the controller is not immediately available for reconnection. When -noReconnect is specified, preserve the existing behavior of either terminating fatally under normal scenarios or terminating non-fatally when running under Swarm (to allow Swarm to do its own exponential backoff implementation instead).

Implementation

We have made no attempt to clean up the rather complex control flow and duplicate code in this module. We did, however, leave notes that might aid a future refactoring effort.

Notes to Reviewers

Please review with the Hide Whitespace feature enabled.

Testing Done

Reproduced the problem by shutting down the controller and attempting to connect via the second entrypoint directly. Received the exception and a fatal termination before this PR, while after this PR Remoting would sleep 10 seconds until the controller was back online and the connection succeeded. Also verified that the existing behavior, including fatal exit code, was preserved when running with -noReconnect.

@basil basil added backporting-candidate bug For changelog: Fixes a bug. labels Oct 11, 2023
} catch (Exception e) {
if (Boolean.getBoolean(Engine.class.getName() + ".nonFatalJnlpAgentEndpointResolutionExceptions")) {
events.status("Could not resolve JNLP agent endpoint", e);
} catch (IOException e) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor cleanup while I was here. The original code was catching any exception type, including InterruptedException (which it made no attempt to handle). Following generic advice such as that given on this page, prefer specific exceptions in catch blocks.

Copy link
Member

@timja timja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hardcoded 10 seconds sleep doesn't seem ideal but all the pre-existing code does it so 👍


and thanks for the follow up PR at #676 that fixes the hardcoded sleep issue.

Copy link
Contributor

@MarkEWaite MarkEWaite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@basil basil merged commit d76b9dd into jenkinsci:master Oct 12, 2023
12 checks passed
@basil basil deleted the retry branch October 12, 2023 15:34
Comment on lines +769 to +770
if (Boolean.getBoolean(Engine.class.getName() + ".nonFatalJnlpAgentEndpointResolutionExceptions")) {
events.status("Could not resolve JNLP agent endpoint", e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to readers excluding the author (not a review): this property derives from #449, where SwarmClient wraps hudson.remoting.jnlp.Main in-JVM and thus the call to System.exit would be “fatal”. For normal uses of Remoting, the distinction is merely between logging a stack trace at SEVERE and exiting with -1 vs. logging a stack trace at INFO and exiting with 0 (but still effectively treating the error as fatal).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thus the call to System.exit would be “fatal”

Not sure why scare quotes were used here, but yes a call to System.exit is fatal to the life of the process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backporting-candidate bug For changelog: Fixes a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants