FetchBatch bug ("nats: timeout error") with large RTT to the NATS server? #1683

ashumkin · 2024-07-19T12:21:22Z

ashumkin
Jul 19, 2024

Before filing a bug I'd like to discuss the situation:
I have a made a little demo app which reproduces the issue - https://github.com/ashumkin/tests-nats-fetch-batch-timeout/blob/8747438a37a0ee0eab1151c4aec068fa18ecb2ef/main.go

The issue is in the following:
When the app is run connected to the local NATS instance all works as expected: no error messages printed when there are no messages in a stream. Just a bunch of messages like
... INFO batch processed count=0
The app is run (stream is created via nats stream add test --subjects test --defaults)
go run main.go localhost test 5s

But when the app connects to a distant located NATS, batch.Error() function returns the "nats: timeout" error.
E.g. to demo.nats.io + stream subscription-updates.
go run main.go demo.nats.io subscription-updates 5s

... WARN failed to fetch batch err="nats: timeout" count=0

Noteworthy, this error is also returned then I connect to the SAME LOCAL NATS server, but when connection is done with an increased RTT/using more TCP hops than local connection. I've done that with the help of my VPS using SSH port forwarding to the server and back (ssh MY-VPS -L 127.0.0.1:4220:127.0.0.1:4222 -R 127.0.0.1:4222:127.0.0.1:4222 -N).
go run main.go localhost:4220 test 5s (port 4220 goes to the VPS and back to local NATS on port 4222)

$ nats rtt -s localhost:4222
nats://localhost:4222:

   nats://127.0.0.1:4222: 149.94µs

$ nats rtt -s localhost:4220
nats://localhost:4220:

   nats://127.0.0.1:4220: 96.188225ms

then

... WARN failed to fetch batch err="nats: timeout" count=0

nats.go version is github.com/nats-io/nats.go v1.36.0

I've debugged the traffic and I've seen that normally

client sends ...CONSUMER.MSG.NEXT.. requests
servers sends 408 timeout reponse
then steps 1-2 are repeated

but with increased RTT

client sends ...CONSUMER.MSG.NEXT.. request
then client sends next ...CONSUMER.MSG.NEXT.. request
servers sends 408 timeout response to the 1st request
client receives server's response and already "do not know" its reqID as it already have sent a new request, so it fails with the error
then repeat 2-4

Also, I've cloned the nats.go library (tag v1.36.0) and run the patched

diff --git a/test/kv_test.go b/test/kv_test.go
index a1b5ca4..2b62b28 100644
--- a/test/kv_test.go
+++ b/test/kv_test.go
@@ -999,7 +999,7 @@ func TestKeyValueDuplicatesWindow(t *testing.T) {
 
 func client(t *testing.T, s *server.Server, opts ...nats.Option) *nats.Conn {
    t.Helper()
-   nc, err := nats.Connect(s.ClientURL(), opts...)
+   nc, err := nats.Connect("nats://localhost:4222", opts...)
    if err != nil {
        t.Fatalf("Unexpected error: %v", err)
    }

tests
go test ./test -run FetchBatch

they expectedly succeed

ok      github.com/nats-io/nats.go/test 1.576s

But when I use NATS with increased RTT (change the port to 4220 as above)

diff --git a/test/kv_test.go b/test/kv_test.go
index a1b5ca4..2b62b28 100644
--- a/test/kv_test.go
+++ b/test/kv_test.go
@@ -999,7 +999,7 @@ func TestKeyValueDuplicatesWindow(t *testing.T) {
 
 func client(t *testing.T, s *server.Server, opts ...nats.Option) *nats.Conn {
    t.Helper()
-   nc, err := nats.Connect(s.ClientURL(), opts...)
+   nc, err := nats.Connect("nats://localhost:4220", opts...)
    if err != nil {
        t.Fatalf("Unexpected error: %v", err)
    }

tests
go test ./test -run FetchBatch
fail!

--- FAIL: TestPullSubscribeFetchBatchWithHeartbeat (1.49s)
    js_test.go:1358: Unexpected error: nats: no heartbeat received
--- FAIL: TestPullSubscribeFetchBatch (39.70s)
    --- FAIL: TestPullSubscribeFetchBatch/multiple_concurrent_fetches (7.92s)
        js_test.go:1547: Timeout waiting for incoming messages
    --- FAIL: TestPullSubscribeFetchBatch/context_timeout,_no_error (1.38s)
        js_test.go:1689: Unexpected error: context deadline exceeded
    --- FAIL: TestPullSubscribeFetchBatch/request_expired (0.91s)
        js_test.go:1716: Unexpected error: nats: timeout
    --- FAIL: TestPullSubscribeFetchBatch/remove_durable_consumer_during_fetch (11.07s)
        js_test.go:1779: Expected error: nats: consumer deleted; got: %!s(<nil>)
    --- FAIL: TestPullSubscribeFetchBatch/close_subscription (11.16s)
        js_test.go:1837: Unexpected error: nats: consumer not found
FAIL
FAIL	github.com/nats-io/nats.go/test	41.194s
FAIL

I guess this is a bug in FetchBatch? I guess, tests must not fail depending on RTT to the server?

Answered by piotrpio

Jul 30, 2024

Hello @ashumkin. The timeout you're encountering is introduced as a safeguard in case we don't get the timeout response from the server. It has to be larger than the expiry set on the pull request, but the value needs to be arbitrary since we can't really speculate on the RTT.

That being said, I see that the added time is way to small now (10ms), giving high probability of hitting the internal (client) timeout rather than waiting for server timeout. I'll increase the value so that it should work fine under most circumstances but you still may be getting timeout errors on unusually high RTTs.

View full answer

aricart · 2024-07-19T12:34:08Z

aricart
Jul 19, 2024
Collaborator

What is your default max wait and time out options?

…

On Fri, Jul 19, 2024, 8:21 AM Alexey Shumkin ***@***.***> wrote: Before filing a bug I'd like to discuss the situation: I have a made a little demo app which reproduces the issue - https://github.com/ashumkin/tests-nats-fetch-batch-timeout/blob/8747438a37a0ee0eab1151c4aec068fa18ecb2ef/main.go The issue is in the following: When the app is run connected to the local NATS instance all works as expected: no error messages printed when there are no messages in a stream. Just a bunch of messages like ... INFO batch processed count=0 The app is run (stream is created via nats stream add test --subjects test --defaults) go run main.go localhost test 5s But when the app connects to a distant located NATS, batch.Error() function returns the "nats: timeout" error. E.g. to demo.nats.io + stream subscription-updates. go run main.go demo.nats.io subscription-updates 5s ... WARN failed to fetch batch err="nats: timeout" count=0 Noteworthy, this error is also returned then I connect to the SAME LOCAL NATS server, but when connection is done with an increased RTT/using more TCP hops than local connection. I've done that with the help of my VPS using SSH port forwarding to the server and back (ssh MY-VPS -L 127.0.0.1:4220:127.0.0.1:4222 -R 127.0.0.1:4222:127.0.0.1:4222 -N). go run main.go localhost:4220 test 5s (port 4220 goes to the VPS and back to local NATS on port 4222) then ... WARN failed to fetch batch err="nats: timeout" count=0 nats.go version is github.com/nats-io/nats.go v1.36.0 I've debugged the traffic and I've seen that normally 1. client sends ...CONSUMER.MSG.NEXT.. requests 2. servers sends 408 timeout reponse 3. then steps 1-2 are repeated but with increased RTT 1. client sends ...CONSUMER.MSG.NEXT.. request 2. then client sends next ...CONSUMER.MSG.NEXT.. request 3. servers sends 408 timeout response to the 1st request 4. client receives server's response and already "do not know" its reqID <https://github.com/nats-io/nats.go/blob/97e6a525e9d861338e89d31e81063a726a718941/js.go#L3226> as it already have sent a new request, so it fails with the error 5. then repeat 2-4 Also, I've cloned the nats.go library (tag v1.36.0) and run the patched diff --git a/test/kv_test.go b/test/kv_test.go index a1b5ca4..2b62b28 100644--- a/test/kv_test.go+++ b/test/kv_test.go@@ -999,7 +999,7 @@ func TestKeyValueDuplicatesWindow(t *testing.T) { func client(t *testing.T, s *server.Server, opts ...nats.Option) *nats.Conn { t.Helper()- nc, err := nats.Connect(s.ClientURL(), opts...)+ nc, err := nats.Connect("nats://localhost:4222", opts...) if err != nil { t.Fatalf("Unexpected error: %v", err) } tests go test ./test -run FetchBatch they expectedly succeed ok github.com/nats-io/nats.go/test 1.576s But when I use NATS with increased RTT (change the port to 4220 as above) diff --git a/test/kv_test.go b/test/kv_test.go index a1b5ca4..2b62b28 100644--- a/test/kv_test.go+++ b/test/kv_test.go@@ -999,7 +999,7 @@ func TestKeyValueDuplicatesWindow(t *testing.T) { func client(t *testing.T, s *server.Server, opts ...nats.Option) *nats.Conn { t.Helper()- nc, err := nats.Connect(s.ClientURL(), opts...)+ nc, err := nats.Connect("nats://localhost:4220", opts...) if err != nil { t.Fatalf("Unexpected error: %v", err) } tests go test ./test -run FetchBatch fail! --- FAIL: TestPullSubscribeFetchBatchWithHeartbeat (1.49s) js_test.go:1358: Unexpected error: nats: no heartbeat received --- FAIL: TestPullSubscribeFetchBatch (39.70s) --- FAIL: TestPullSubscribeFetchBatch/multiple_concurrent_fetches (7.92s) js_test.go:1547: Timeout waiting for incoming messages --- FAIL: TestPullSubscribeFetchBatch/context_timeout,_no_error (1.38s) js_test.go:1689: Unexpected error: context deadline exceeded --- FAIL: TestPullSubscribeFetchBatch/request_expired (0.91s) js_test.go:1716: Unexpected error: nats: timeout --- FAIL: TestPullSubscribeFetchBatch/remove_durable_consumer_during_fetch (11.07s) js_test.go:1779: Expected error: nats: consumer deleted; got: %!s(<nil>) --- FAIL: TestPullSubscribeFetchBatch/close_subscription (11.16s) js_test.go:1837: Unexpected error: nats: consumer not found FAIL FAIL github.com/nats-io/nats.go/test 41.194s FAIL I guess this is a bug in FetchBatch? I guess, tests must not fail depending on RTT to the server? — Reply to this email directly, view it on GitHub <#1683>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAH4GED5ZFOQQYJYNUM2UELZNEAFZAVCNFSM6AAAAABLEPTUQGVHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZWHE2TGMBXHE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

1 reply

ashumkin Jul 19, 2024
Author

hmm, defaults are not changed

Local NATS server is run via docker run --name nats --rm -itd -p 4222:4222 -p 8222:8222 nats:2.10 --js -m 8222

Demo app uses CLI argument for nats.MaxWait option for FetchBatch. Changing maxWait time changes the period between errors but not the error occurence.

piotrpio · 2024-07-30T15:12:32Z

piotrpio
Jul 30, 2024
Maintainer

Hello @ashumkin. The timeout you're encountering is introduced as a safeguard in case we don't get the timeout response from the server. It has to be larger than the expiry set on the pull request, but the value needs to be arbitrary since we can't really speculate on the RTT.

That being said, I see that the added time is way to small now (10ms), giving high probability of hitting the internal (client) timeout rather than waiting for server timeout. I'll increase the value so that it should work fine under most circumstances but you still may be getting timeout errors on unusually high RTTs.

2 replies

ashumkin Jul 30, 2024
Author

ok, I see. Thank you.

Will wait the fix. For now will consider it as a intended behaviour )

ashumkin Sep 24, 2024
Author

JFTR
It seems #1689 is intended to fix the issue. I'll check...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FetchBatch bug ("nats: timeout error") with large RTT to the NATS server? #1683

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

FetchBatch bug ("nats: timeout error") with large RTT to the NATS server? #1683

ashumkin Jul 19, 2024

Replies: 2 comments · 3 replies

aricart Jul 19, 2024 Collaborator

ashumkin Jul 19, 2024 Author

piotrpio Jul 30, 2024 Maintainer

ashumkin Jul 30, 2024 Author

ashumkin Sep 24, 2024 Author

ashumkin
Jul 19, 2024

Replies: 2 comments 3 replies

aricart
Jul 19, 2024
Collaborator

ashumkin Jul 19, 2024
Author

piotrpio
Jul 30, 2024
Maintainer

ashumkin Jul 30, 2024
Author

ashumkin Sep 24, 2024
Author