-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
switch from tkn client to k8s pod log retrieval; fix panic, add debug to e2e_gcs_test #715
switch from tkn client to k8s pod log retrieval; fix panic, add debug to e2e_gcs_test #715
Conversation
The following is the coverage report on the affected files.
|
c9ed339
to
b340071
Compare
The following is the coverage report on the affected files.
|
b340071
to
97634f4
Compare
The following is the coverage report on the affected files.
|
@sayan-biswas @avinal @enarha @ramessesii2 @adambkaplan FYI a first pass at replacing tkn cli with generic k8s get pod logs API calls I could see this replacing #712 after a little more refining.... but curious on opinions / feedback also really close to getting rid of launching the separate goroutine in sendLogs, but could also see deferring that to a follow pull request that also dives into verifying UpdateLogs completed in the API server to completely remove the intermittent canceled context UpdateLog instances we are seeing in Pavel's soak test. But I took a first pass at that today and that is going to require some unit test overall, as I've discovered the unit tests are not fully wired to return fake data. |
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
f9f94f4
to
a9570cd
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
if podLogsErr != nil { | ||
return podLogsErr | ||
} | ||
hdr := fmt.Sprintf("*** Logs for pod %s container %s ***\n", pod.Name, container.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shoudl we keep the format same as the tkn logs?
https://github.com/tektoncd/cli/blob/0369f038b9fa32fe746eb1b32db6345822a9dc5c/pkg/log/writer.go#L56-L61
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as long as we merge #712 first so we can get some immediately relief with respect to the memory leak, then yes, I'll look more into the "Rainbow" decorator or whatever it was that I saw while investigating things, and see about keeping that format.
Also, I want to get a baseline in our prod env (vs. our separate stress tests in test envs) with the current multi-threaded format before moving on to this PR's further reduction in threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not notice that the colouring package is used. Since we are storing the logs I don't think we should do the "rainbow" colours, but the format with [task-name] and [step-name] is what I meant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switch to old format is up and passed e2e's (though a second e2e run is in progress as I type as I've removed debug and squashed related commits)
PTAL @sayan-biswas @avinal
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sayan-biswas The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Looks good to me. I ran it locally and the logs were stored perfectly. Although I did't check with bulk pipeline runs and larger logs. I just have one comment on the log format.
|
@sayan-biswas is your question here ^^ separate from the one in #715 (comment) ? If it is, apologies I'm missing what you are getting at, and ask that you elaborate either here or in slack/team meetings. Thanks. |
@gabemontero No, the one the comment is what I was referring. Everything is working for me. Did not test memory footprint though. |
OK thanks for confirming the comments refer to the same core notion @sayan-biswas As to memory footprint, I am waiting to hear back from Pavel on when he can start a smoke test, so not yet. This is a factor as to why I want to merge #712 first. We have validated the memory footprint is fixed there. Also, I want to see how this change looks in our prod env for a bit with more worker threads, particularly with the workqueue depth, so as to potentially compare with this change. |
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
OK success @sayan-biswas @avinal @ramessesii2 @adambkaplan @pmacik @khrm with https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_results/715/pull-tekton-results-integration-tests/1763567949735530496 This PR now:
side bar - @khrm I'll be curious how your viper POC mirror / overlaps with the grpc machinations with the results api server I'll work out with @pmacik on getting the next soak test run started, but ideally in parallel, let's get the review going here, assuming any requested changes are minor and don't alter the overall structure noted above in this comment. Thanks |
60a8dd1
to
b8a991a
Compare
The following is the coverage report on the affected files.
|
good news @sayan-biswas @avinal @ramessesii2 @adambkaplan @pmacik @khrm the soak test completed over the weekend and both
let's please go ahead and review and get this PR ready for merge. Aside from fixing the intermittent canceled context errors, this change also allows us to better retry on retryable errrors. thanks |
/hold cancel update PR description, issue fixed, release note, to address latest findings |
|
||
// comparing closeErr with io.EOF does not work; and I could not find code / desc etc. constants in the grpc code that handled | ||
// the wrapped EOF error we expect to get from grpc when things are "OK" | ||
if logSummary, closeErr := logsClient.CloseAndRecv(); closeErr != nil && !strings.Contains(closeErr.Error(), "EOF") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this wait for the streaming to complete? If so, then shouldn't his hold the main reconciler loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes and yes .... all my recent work has proven beyond a doubt that spawning threads off of the reconciler thread for these grpc calls is absolutely untenable..... absolutely cannot go back to that mode of operation moving forward
- go routine / mem leaks because contexts are not properly cleaned up
- intermittent UpdateLog calls getting aborted because of canceled context
- and as you know, since you've reported it to me previously, can't do proper retry on retryable errors with it
My commit to bump the worker thread pool to 32 when logging is enabled is in part a means to offset this, in case users inadvertently try stress tests with this ... that said, prior soak tests from Pavel when the threadiness was still 2 vs. 32 showed if "performs OK"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I'm fine losing that commit and leaving the default worker thread count to 2 if there is concern about 32 threads running with kind on one's laptop .... I seriously doubt that is the case given the pprof data Pavel accumulated so far, but the other issues were more of the focus.
b8a991a
to
a8d623e
Compare
The following is the coverage report on the affected files.
|
a8d623e
to
4b55355
Compare
The following is the coverage report on the affected files.
|
fix panic, add debug, to e2e_gcs_test rm tkn cli dependency go mod tidy, vendor add rbac for pod listing move streamLogs onto reconciler thread mimic tkn client style prefixing on each line of log These changes address a goroutine/memory leak that existed with the multi threaded tkn client version, as well as now allow us to reconcile on retryable errors rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED
rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED
…e goroutine cleanup by not interrupt UpdateLog call before it finishes rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED
4b55355
to
5530c4b
Compare
The following is the coverage report on the affected files.
|
/hold still waiting to see in our prod env if this switch is necessary based on how the current tkn based solution performs |
@gabemontero: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
Changes
Fixes #720
alternative to #712 with regard to still addressing the goroutine/memory leak, swtiching from the tkn client to k8s client, but with the reduction of threads, allows for retry when appropriate on UpdateLog errors.
/kind bug
These changes address a goroutine/memory leak that existed with the prior main branch vesrion,
Additionally
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you review them:
/kind <type>
. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tepRelease Notes