Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fetcher autonomously panics #30

Closed
LeslieLeung opened this issue Jun 25, 2024 · 11 comments
Closed

fetcher autonomously panics #30

LeslieLeung opened this issue Jun 25, 2024 · 11 comments

Comments

@LeslieLeung
Copy link

I am using github.com/lestrrat-go/jwx/v2 v2.1.0 and github.com/lestrrat-go/httprc v1.0.5, and the fetcher would randomly panics with the following trace:

panic: send on closed channel

goroutine 194 [running]:
github.com/lestrrat-go/httprc.(*fetchResult).reply(0xc000ba8b70, {0x2104a00, 0x3007980}, 0xc004ba5b60)
/go/pkg/mod/github.com/lestrrat-go/[email protected]/fetcher.go:37 +0x86
github.com/lestrrat-go/httprc.runFetchWorker({0x2104a00, 0x3007980}, 0xc0000bbbc0, {0x0, 0x0})
/go/pkg/mod/github.com/lestrrat-go/[email protected]/fetcher.go:177 +0x2ac
created by github.com/lestrrat-go/httprc.NewFetcher in goroutine 51
/go/pkg/mod/github.com/lestrrat-go/[email protected]/fetcher.go:75 +0x1bc

Given the goroutine is launched inside httprc, I don't have any mean to recover from this panic.

Here is the code on the jwx side if it helps:

jwx code
type Verifier struct {
	self       string
	pkEndpoint string

	c  *jwk.Cache
	cs jwk.Set
}

func NewVerifier(host, self string) (*Verifier, error) {
	ctx := context.Background()
	pkEndpoint := host + ApiGetPublicKey
	c := jwk.NewCache(ctx)
	err := c.Register(pkEndpoint, jwk.WithMinRefreshInterval(5*time.Minute))
	if err != nil {
		log.Errorf(ctx, "Error registering public key endpoint: %v", err)
		return nil, err
	}
	_, err = c.Refresh(ctx, pkEndpoint)
	if err != nil {
		log.Errorf(ctx, "Error fetching public key: %v", err)
		return nil, err
	}
	cached := jwk.NewCachedSet(c, pkEndpoint)
	v := &Verifier{
		self:       self,
		pkEndpoint: pkEndpoint,
		c:          c,
		cs:         cached,
	}

	return v, nil
}

func (v *Verifier) Verify(token string, maxRetry int) (bool, error) {
	ctx := context.Background()
	for i := 0; i < maxRetry; i++ {
		// allow refreshing public keys if fails
		_, err := jwt.Parse([]byte(token), jwt.WithKeySet(v.cs))
		if err == nil {
			return true, nil
		}

		log.Errorf(ctx, "Error verifying token: %v", err)
		_, err = v.c.Refresh(ctx, v.pkEndpoint)
		if err != nil {
			log.Errorf(ctx, "Error refreshing public key: %v", err)
			return false, err
		}
	}

	return false, nil
}
@milescrabill
Copy link

For what it's worth we noticed this after upgrading from jwx v2.0.21 to v2.1.0 which upgrades httprc from v1.0.4 to v1.0.5.

The only suspicious change (to me) there is the change in where the queue mutex is locked in fetchAndStore here.

@TheJokr
Copy link

TheJokr commented Jul 10, 2024

I see similar panics with httprc 1.0.5, stemming from code that uses jwx 2.1.0 like LeslieLeung's.

The only suspicious change (to me) there is the change in where the queue mutex is locked in fetchAndStore here.

I think so too, the mutex isn't held during the q.fetch.fetch(...) call anymore. This probably allows a foreground Refresh(ctx) and the background fetcher to call q.fetch.fetch(...) concurrently with the same *fetchRequest pointer, leading to the channel being overwritten and closed twice.

@milescrabill
Copy link

So I realized that my statement above was not quite correct. jwx v2.0.21 is where the upgrade from v1.0.4 to v1.0.5 happened.

@natenjoy I see that you created the PR that made this mutex change, have you observed these issues at all?
@lestrrat any thoughts on this issue?

@natenjoy
Copy link

I don't see it in our logs for the past 3 months. Are you able to reproduce this issue somewhat regularly? If so, could you set GOTRACEBACK=all and rerun the service, looking for likely causes in other goroutines? Looking at httprc 1.0.5 and those line numbers from the stack trace, the code in the trace appears to be trying to receive from a channel and not send on a closed channel. Sending on a closed channel is what is panicking.

@TheJokr
Copy link

TheJokr commented Jul 11, 2024

the code in the trace appears to be trying to receive from a channel and not send on a closed channel.

The line number indicated at the top of the traceback is a select block, which does contain a send case: case reply <- fr:. This is where the panic originates from, I guess go's tracebacks just don't distinguish between which case caused the panic.

As I mentioned above, I'm pretty sure that the reason for the channel being closed is that 2 goroutines are executing (*fetcher).fetch(...) with the same *fetchRequest in parallel. While each fetch() does create a new reply channel, the channel is written into the same *fetchRequest pointer. This pointer is then added to f.requests here once by each call. After the first of those 2 entries in f.requests is done, the channel is closed in reply(...). This leaves the 2nd entry with a closed channel from the start, ultimately causing the panic.

@natenjoy
Copy link

The line number indicated at the top of the traceback is a select block, which does contain a send case: case reply <- fr:

Correct, I misread that second case. I opened #31 as a revert commit of the prior change.

@lestrrat
Copy link
Collaborator

Sorry, I completely missed the original report 2 weeks ago. I'm a bit busy today, please give me a day or two to sort things out

@lestrrat
Copy link
Collaborator

Okay. I understand the analysis, and I will take everybody's word for it, but the truth is that I can't reproduce it, so tbh I feel a bit uneasy.

I'll merge #31 and what not, but If anybody can come up with a reliable reproduction test case in the future, I think it will benefit all of us.

@lestrrat
Copy link
Collaborator

merged #31, so will close for now. Please open a new issue if problem persists

@TheJokr
Copy link

TheJokr commented Jul 13, 2024

Thanks for releasing the fix @lestrrat! I built a small reproduction example as a gist here, in case you are interested: https://gist.github.com/TheJokr/d5b836cca484d4a00967504c553987cf

It panics consistently with v1.0.5, but I had to rely on timing to make sure the prerequisites for the bug are met. A debugger with breakpoints would likely mess the timing up, so access to httprc's internals would be necessary for proper debugging. The panic does not occur with v1.0.6 👍

lestrrat added a commit that referenced this issue Jul 13, 2024
I have run this against the version that panics, and verified that
it indeed reproduces #30. Future changes can probably rely on
this test for _preliminary_ checks against the same symptom.
lestrrat added a commit that referenced this issue Jul 13, 2024
* Add test

I have run this against the version that panics, and verified that
it indeed reproduces #30. Future changes can probably rely on
this test for _preliminary_ checks against the same symptom.

* Appease linter
@lestrrat
Copy link
Collaborator

@TheJokr Thanks, merged as #32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants