fetcher autonomously panics #30

LeslieLeung · 2024-06-25T06:55:18Z

I am using github.com/lestrrat-go/jwx/v2 v2.1.0 and github.com/lestrrat-go/httprc v1.0.5, and the fetcher would randomly panics with the following trace:

panic: send on closed channel

goroutine 194 [running]:
github.com/lestrrat-go/httprc.(*fetchResult).reply(0xc000ba8b70, {0x2104a00, 0x3007980}, 0xc004ba5b60)
/go/pkg/mod/github.com/lestrrat-go/[email protected]/fetcher.go:37 +0x86
github.com/lestrrat-go/httprc.runFetchWorker({0x2104a00, 0x3007980}, 0xc0000bbbc0, {0x0, 0x0})
/go/pkg/mod/github.com/lestrrat-go/[email protected]/fetcher.go:177 +0x2ac
created by github.com/lestrrat-go/httprc.NewFetcher in goroutine 51
/go/pkg/mod/github.com/lestrrat-go/[email protected]/fetcher.go:75 +0x1bc

Given the goroutine is launched inside httprc, I don't have any mean to recover from this panic.

Here is the code on the jwx side if it helps:

jwx code

type Verifier struct {
	self       string
	pkEndpoint string

	c  *jwk.Cache
	cs jwk.Set
}

func NewVerifier(host, self string) (*Verifier, error) {
	ctx := context.Background()
	pkEndpoint := host + ApiGetPublicKey
	c := jwk.NewCache(ctx)
	err := c.Register(pkEndpoint, jwk.WithMinRefreshInterval(5*time.Minute))
	if err != nil {
		log.Errorf(ctx, "Error registering public key endpoint: %v", err)
		return nil, err
	}
	_, err = c.Refresh(ctx, pkEndpoint)
	if err != nil {
		log.Errorf(ctx, "Error fetching public key: %v", err)
		return nil, err
	}
	cached := jwk.NewCachedSet(c, pkEndpoint)
	v := &Verifier{
		self:       self,
		pkEndpoint: pkEndpoint,
		c:          c,
		cs:         cached,
	}

	return v, nil
}

func (v *Verifier) Verify(token string, maxRetry int) (bool, error) {
	ctx := context.Background()
	for i := 0; i < maxRetry; i++ {
		// allow refreshing public keys if fails
		_, err := jwt.Parse([]byte(token), jwt.WithKeySet(v.cs))
		if err == nil {
			return true, nil
		}

		log.Errorf(ctx, "Error verifying token: %v", err)
		_, err = v.c.Refresh(ctx, v.pkEndpoint)
		if err != nil {
			log.Errorf(ctx, "Error refreshing public key: %v", err)
			return false, err
		}
	}

	return false, nil
}

The text was updated successfully, but these errors were encountered:

milescrabill · 2024-07-09T18:12:16Z

For what it's worth we noticed this after upgrading from jwx v2.0.21 to v2.1.0 which upgrades httprc from v1.0.4 to v1.0.5.

The only suspicious change (to me) there is the change in where the queue mutex is locked in fetchAndStore here.

TheJokr · 2024-07-10T07:09:30Z

I see similar panics with httprc 1.0.5, stemming from code that uses jwx 2.1.0 like LeslieLeung's.

The only suspicious change (to me) there is the change in where the queue mutex is locked in fetchAndStore here.

I think so too, the mutex isn't held during the q.fetch.fetch(...) call anymore. This probably allows a foreground Refresh(ctx) and the background fetcher to call q.fetch.fetch(...) concurrently with the same *fetchRequest pointer, leading to the channel being overwritten and closed twice.

milescrabill · 2024-07-11T15:10:22Z

So I realized that my statement above was not quite correct. jwx v2.0.21 is where the upgrade from v1.0.4 to v1.0.5 happened.

@natenjoy I see that you created the PR that made this mutex change, have you observed these issues at all?
@lestrrat any thoughts on this issue?

natenjoy · 2024-07-11T19:08:57Z

I don't see it in our logs for the past 3 months. Are you able to reproduce this issue somewhat regularly? If so, could you set GOTRACEBACK=all and rerun the service, looking for likely causes in other goroutines? Looking at httprc 1.0.5 and those line numbers from the stack trace, the code in the trace appears to be trying to receive from a channel and not send on a closed channel. Sending on a closed channel is what is panicking.

TheJokr · 2024-07-11T20:07:09Z

the code in the trace appears to be trying to receive from a channel and not send on a closed channel.

The line number indicated at the top of the traceback is a select block, which does contain a send case: case reply <- fr:. This is where the panic originates from, I guess go's tracebacks just don't distinguish between which case caused the panic.

As I mentioned above, I'm pretty sure that the reason for the channel being closed is that 2 goroutines are executing (*fetcher).fetch(...) with the same *fetchRequest in parallel. While each fetch() does create a new reply channel, the channel is written into the same *fetchRequest pointer. This pointer is then added to f.requests here once by each call. After the first of those 2 entries in f.requests is done, the channel is closed in reply(...). This leaves the 2nd entry with a closed channel from the start, ultimately causing the panic.

natenjoy · 2024-07-11T21:26:56Z

The line number indicated at the top of the traceback is a select block, which does contain a send case: case reply <- fr:

Correct, I misread that second case. I opened #31 as a revert commit of the prior change.

lestrrat · 2024-07-11T22:56:30Z

Sorry, I completely missed the original report 2 weeks ago. I'm a bit busy today, please give me a day or two to sort things out

lestrrat · 2024-07-12T21:04:42Z

Okay. I understand the analysis, and I will take everybody's word for it, but the truth is that I can't reproduce it, so tbh I feel a bit uneasy.

I'll merge #31 and what not, but If anybody can come up with a reliable reproduction test case in the future, I think it will benefit all of us.

lestrrat · 2024-07-12T21:07:01Z

merged #31, so will close for now. Please open a new issue if problem persists

TheJokr · 2024-07-13T12:34:41Z

Thanks for releasing the fix @lestrrat! I built a small reproduction example as a gist here, in case you are interested: https://gist.github.com/TheJokr/d5b836cca484d4a00967504c553987cf

It panics consistently with v1.0.5, but I had to rely on timing to make sure the prerequisites for the bug are met. A debugger with breakpoints would likely mess the timing up, so access to httprc's internals would be necessary for proper debugging. The panic does not occur with v1.0.6 👍

I have run this against the version that panics, and verified that it indeed reproduces #30. Future changes can probably rely on this test for _preliminary_ checks against the same symptom.

* Add test I have run this against the version that panics, and verified that it indeed reproduces #30. Future changes can probably rely on this test for _preliminary_ checks against the same symptom. * Appease linter

lestrrat · 2024-07-13T22:46:57Z

@TheJokr Thanks, merged as #32

natenjoy mentioned this issue Jul 11, 2024

Revert "Aquire semaphore when network will be accessed (#25)" #31

Merged

lestrrat closed this as completed Jul 12, 2024

lestrrat added a commit that referenced this issue Jul 13, 2024

Add test

faf6c89

I have run this against the version that panics, and verified that it indeed reproduces #30. Future changes can probably rely on this test for _preliminary_ checks against the same symptom.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fetcher autonomously panics #30

fetcher autonomously panics #30

LeslieLeung commented Jun 25, 2024

milescrabill commented Jul 9, 2024

TheJokr commented Jul 10, 2024 •

edited

Loading

milescrabill commented Jul 11, 2024

natenjoy commented Jul 11, 2024

TheJokr commented Jul 11, 2024

natenjoy commented Jul 11, 2024

lestrrat commented Jul 11, 2024

lestrrat commented Jul 12, 2024

lestrrat commented Jul 12, 2024

TheJokr commented Jul 13, 2024

lestrrat commented Jul 13, 2024

fetcher autonomously panics #30

fetcher autonomously panics #30

Comments

LeslieLeung commented Jun 25, 2024

milescrabill commented Jul 9, 2024

TheJokr commented Jul 10, 2024 • edited Loading

milescrabill commented Jul 11, 2024

natenjoy commented Jul 11, 2024

TheJokr commented Jul 11, 2024

natenjoy commented Jul 11, 2024

lestrrat commented Jul 11, 2024

lestrrat commented Jul 12, 2024

lestrrat commented Jul 12, 2024

TheJokr commented Jul 13, 2024

lestrrat commented Jul 13, 2024

TheJokr commented Jul 10, 2024 •

edited

Loading