switch from tkn client to k8s pod log retrieval; fix panic, add debug to e2e_gcs_test #715

gabemontero · 2024-02-20T21:30:06Z

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED

Changes

Fixes #720

alternative to #712 with regard to still addressing the goroutine/memory leak, swtiching from the tkn client to k8s client, but with the reduction of threads, allows for retry when appropriate on UpdateLog errors.

/kind bug

These changes address a goroutine/memory leak that existed with the prior main branch vesrion,

Additionally

switch from tkn client to k8s pod log retrieval
fix panic, add debug, to e2e_gcs_test
rm tkn cli dependency
go mod tidy, vendor
add rbac for pod listing
move streamLogs onto reconciler thread
dump default worker pool thread count to 32 when logs are enabled
switch from grpc CloseSend to CloseAndRecv to address intermittent canceled context errors

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you review them:

[n/a ] Has Docs included if any changes are user facing
[ n/a] Has Tests included if any functionality added or changed
[ /] Tested your changes locally (if this is a code change)
[ /] Follows the commit message standard
[ /] Meets the Tekton contributor standards (including functionality, content, code)
[/ ] Has a kind label. You can add a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
[n/a ] Release notes block below has been updated with any user-facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)
[ n/a] Release notes contain the string "action required" if the change requires additional action from users switching to the new release

Release Notes

addresses UpdateLog calls from the watcher failing with canceled context, and allows for proper erro retry, while still fixing goroutine leak, which leads to overall memory leak, when storing logs

tekton-robot · 2024-02-20T21:35:07Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/watcher/reconciler/dynamic/dynamic.go	69.3%	55.7%	-13.7

tekton-robot · 2024-02-20T21:58:37Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/watcher/reconciler/dynamic/dynamic.go	69.3%	55.7%	-13.7

tekton-robot · 2024-02-20T22:15:51Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/watcher/reconciler/dynamic/dynamic.go	69.3%	55.4%	-14.0

gabemontero · 2024-02-20T22:24:37Z

@sayan-biswas @avinal @enarha @ramessesii2 @adambkaplan FYI a first pass at replacing tkn cli with generic k8s get pod logs API calls

I could see this replacing #712 after a little more refining.... but curious on opinions / feedback

also really close to getting rid of launching the separate goroutine in sendLogs, but could also see deferring that to a follow pull request that also dives into verifying UpdateLogs completed in the API server to completely remove the intermittent canceled context UpdateLog instances we are seeing in Pavel's soak test. But I took a first pass at that today and that is going to require some unit test overall, as I've discovered the unit tests are not fully wired to return fake data.

tekton-robot · 2024-02-21T17:27:23Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/watcher/reconciler/dynamic/dynamic.go	69.3%	55.7%	-13.7

tekton-robot · 2024-02-21T21:04:04Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/watcher/reconciler/dynamic/dynamic.go	69.3%	61.7%	-7.7

tekton-robot · 2024-02-21T21:27:53Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/watcher/reconciler/dynamic/dynamic.go	69.3%	61.5%	-7.9

tekton-robot · 2024-02-22T17:17:08Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/watcher/reconciler/dynamic/dynamic.go	69.3%	61.5%	-7.9

sayan-biswas · 2024-02-23T11:52:37Z

pkg/watcher/reconciler/dynamic/dynamic.go

+			if podLogsErr != nil {
+				return podLogsErr
+			}
+			hdr := fmt.Sprintf("*** Logs for pod %s container %s ***\n", pod.Name, container.Name)


Shoudl we keep the format same as the tkn logs?
https://github.com/tektoncd/cli/blob/0369f038b9fa32fe746eb1b32db6345822a9dc5c/pkg/log/writer.go#L56-L61

as long as we merge #712 first so we can get some immediately relief with respect to the memory leak, then yes, I'll look more into the "Rainbow" decorator or whatever it was that I saw while investigating things, and see about keeping that format.

Also, I want to get a baseline in our prod env (vs. our separate stress tests in test envs) with the current multi-threaded format before moving on to this PR's further reduction in threads.

I did not notice that the colouring package is used. Since we are storing the logs I don't think we should do the "rainbow" colours, but the format with [task-name] and [step-name] is what I meant.

switch to old format is up and passed e2e's (though a second e2e run is in progress as I type as I've removed debug and squashed related commits)

PTAL @sayan-biswas @avinal

tekton-robot · 2024-02-23T11:52:51Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sayan-biswas

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sayan-biswas]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sayan-biswas · 2024-02-23T11:59:10Z

Looks good to me. I ran it locally and the logs were stored perfectly. Although I did't check with bulk pipeline runs and larger logs.

I just have one comment on the log format.

*** Logs for pod test-pipelinerun-dd25b-task1-pod container step-print-uid ***
TaskRun UID: fd819238-26b8-46c7-9e2c-f31a41fe7bff
PipelineRun UID from params: 79e2c2d5-748b-4790-8ef1-eab31c833f10
*** Logs for pod test-pipelinerun-dd25b-task1-pod container step-print-names ***
Task name: test-pipelinerun-dd25b-task1
TaskRun name: test-pipelinerun-dd25b-task1
Pipeline name from params: test-pipelinerun-dd25b
PipelineRun name from params: test-pipelinerun-dd25b

gabemontero · 2024-02-25T15:58:21Z

Looks good to me. I ran it locally and the logs were stored perfectly. Although I did't check with bulk pipeline runs and larger logs.

I just have one comment on the log format.

*** Logs for pod test-pipelinerun-dd25b-task1-pod container step-print-uid ***
TaskRun UID: fd819238-26b8-46c7-9e2c-f31a41fe7bff
PipelineRun UID from params: 79e2c2d5-748b-4790-8ef1-eab31c833f10
*** Logs for pod test-pipelinerun-dd25b-task1-pod container step-print-names ***
Task name: test-pipelinerun-dd25b-task1
TaskRun name: test-pipelinerun-dd25b-task1
Pipeline name from params: test-pipelinerun-dd25b
PipelineRun name from params: test-pipelinerun-dd25b

@sayan-biswas is your question here ^^ separate from the one in #715 (comment) ?

If it is, apologies I'm missing what you are getting at, and ask that you elaborate either here or in slack/team meetings. Thanks.

sayan-biswas · 2024-02-25T18:47:31Z

@gabemontero No, the one the comment is what I was referring. Everything is working for me. Did not test memory footprint though.

gabemontero · 2024-02-26T17:42:24Z

@gabemontero No, the one the comment is what I was referring. Everything is working for me. Did not test memory footprint though.

OK thanks for confirming the comments refer to the same core notion @sayan-biswas

As to memory footprint, I am waiting to hear back from Pavel on when he can start a smoke test, so not yet.

This is a factor as to why I want to merge #712 first. We have validated the memory footprint is fixed there. Also, I want to see how this change looks in our prod env for a bit with more worker threads, particularly with the workqueue depth, so as to potentially compare with this change.

tekton-robot · 2024-02-26T21:22:23Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/watcher/reconciler/dynamic/dynamic.go	69.3%	56.1%	-13.2

tekton-robot · 2024-03-01T14:16:11Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/watcher/reconciler/dynamic/dynamic.go	64.2%	55.7%	-8.5

gabemontero · 2024-03-01T14:54:59Z

OK success @sayan-biswas @avinal @ramessesii2 @adambkaplan @pmacik @khrm with https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_results/715/pull-tekton-results-integration-tests/1763567949735530496

This PR now:

employs k8s pod log APIs with no goroutines launched by the watcher to allow for error inspection and intelligent retry
but employs tkn style line by line prefixing
it employs a child context with timeout and a call to its cancel func which should cleanup all the grpc client goroutines (at least 4 by newClientStreamWithParams alone) that get created as part of making the UpdateLog call to the results API server .... we'll need to confirm this with the next (and hopefully final for now) soak test round with @pmacik
but the switch from CloseSend to CloseAndRecv (which calls CloseSend under the covers, but then makes a Recv call on the LogSummary obj which at least with the integration tests seem to confirm the UpdateLog completes before our cancel func cleans things up; when we only used CloseSend the UpdateLog call would get aborted with a canceled context error

side bar - @khrm I'll be curious how your viper POC mirror / overlaps with the grpc machinations with the results api server

I'll work out with @pmacik on getting the next soak test run started, but ideally in parallel, let's get the review going here, assuming any requested changes are minor and don't alter the overall structure noted above in this comment.

Thanks

tekton-robot · 2024-03-01T15:35:13Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/watcher/reconciler/dynamic/dynamic.go	64.2%	55.7%	-8.5

gabemontero · 2024-03-04T13:42:30Z

OK success @sayan-biswas @avinal @ramessesii2 @adambkaplan @pmacik @khrm with https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_results/715/pull-tekton-results-integration-tests/1763567949735530496

This PR now:
* employs k8s pod log APIs with no goroutines launched by the watcher to allow for error inspection and intelligent retry

* but employs tkn style line by line prefixing

* it employs a child context with timeout and a call to its cancel func which _should_ cleanup all the grpc client goroutines (at least 4 by `newClientStreamWithParams` alone) that get created as part of making the `UpdateLog` call to the results API server .... we'll need to confirm this with the next (and hopefully final for now) soak test round with @pmacik

* but the switch from `CloseSend` to `CloseAndRecv` (which calls `CloseSend` under the covers, but then makes a `Recv` call on the `LogSummary` obj which at least with the integration tests seem to confirm the `UpdateLog` completes before our cancel func cleans things up;  when we only used `CloseSend` the `UpdateLog` call would get aborted with a `canceled context error`
side bar - @khrm I'll be curious how your viper POC mirror / overlaps with the grpc machinations with the results api server

I'll work out with @pmacik on getting the next soak test run started, but ideally in parallel, let's get the review going here, assuming any requested changes are minor and don't alter the overall structure noted above in this comment.

Thanks

good news @sayan-biswas @avinal @ramessesii2 @adambkaplan @pmacik @khrm the soak test completed over the weekend and both

no goroutine / memory lean .... only minor diffs in overall thread accumulation
the intermittent UploadLog canceled context error also appears to have been addressed

let's please go ahead and review and get this PR ready for merge. Aside from fixing the intermittent canceled context errors, this change also allows us to better retry on retryable errrors.

thanks

gabemontero · 2024-03-04T17:00:22Z

/hold cancel

update PR description, issue fixed, release note, to address latest findings

sayan-biswas · 2024-03-06T08:18:58Z

pkg/watcher/reconciler/dynamic/dynamic.go

+
+	// comparing closeErr with io.EOF does not work; and I could not find code / desc etc. constants in the grpc code that handled
+	// the wrapped EOF error we expect to get from grpc when things are "OK"
+	if logSummary, closeErr := logsClient.CloseAndRecv(); closeErr != nil && !strings.Contains(closeErr.Error(), "EOF") {


Does this wait for the streaming to complete? If so, then shouldn't his hold the main reconciler loop?

yes and yes .... all my recent work has proven beyond a doubt that spawning threads off of the reconciler thread for these grpc calls is absolutely untenable..... absolutely cannot go back to that mode of operation moving forward

go routine / mem leaks because contexts are not properly cleaned up

intermittent UpdateLog calls getting aborted because of canceled context

and as you know, since you've reported it to me previously, can't do proper retry on retryable errors with it

My commit to bump the worker thread pool to 32 when logging is enabled is in part a means to offset this, in case users inadvertently try stress tests with this ... that said, prior soak tests from Pavel when the threadiness was still 2 vs. 32 showed if "performs OK"

so I'm fine losing that commit and leaving the default worker thread count to 2 if there is concern about 32 threads running with kind on one's laptop .... I seriously doubt that is the case given the pprof data Pavel accumulated so far, but the other issues were more of the focus.

tekton-robot · 2024-04-01T16:43:20Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/watcher/reconciler/dynamic/dynamic.go	64.2%	55.7%	-8.5

tekton-robot · 2024-04-11T20:03:08Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/watcher/reconciler/dynamic/dynamic.go	55.3%	51.2%	-4.1

fix panic, add debug, to e2e_gcs_test rm tkn cli dependency go mod tidy, vendor add rbac for pod listing move streamLogs onto reconciler thread mimic tkn client style prefixing on each line of log These changes address a goroutine/memory leak that existed with the multi threaded tkn client version, as well as now allow us to reconcile on retryable errors rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED

rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED

…e goroutine cleanup by not interrupt UpdateLog call before it finishes rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED

tekton-robot · 2024-04-17T19:44:31Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/watcher/reconciler/dynamic/dynamic.go	55.3%	51.2%	-4.1

gabemontero · 2024-04-17T19:46:48Z

/hold

still waiting to see in our prod env if this switch is necessary based on how the current tkn based solution performs

tekton-robot · 2024-06-13T16:30:33Z

@gabemontero: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Feb 20, 2024

tekton-robot requested review from avinal and khrm February 20, 2024 21:30

tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 20, 2024

gabemontero force-pushed the update-log-chan-goroutine-k8sclient branch from c9ed339 to b340071 Compare February 20, 2024 21:53

gabemontero force-pushed the update-log-chan-goroutine-k8sclient branch from b340071 to 97634f4 Compare February 20, 2024 22:12

tekton-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 21, 2024

gabemontero force-pushed the update-log-chan-goroutine-k8sclient branch from f9f94f4 to a9570cd Compare February 21, 2024 21:23

gabemontero changed the title ~~[WIP] switch from tkn client to k8s pod log retrieval; fix panic, add debug to e2e_gcs_test~~ switch from tkn client to k8s pod log retrieval; fix panic, add debug to e2e_gcs_test Feb 22, 2024

tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 22, 2024

gabemontero mentioned this pull request Feb 22, 2024

employ some channel best practices to avoid hanging goroutines #712

Merged

8 tasks

sayan-biswas approved these changes Feb 23, 2024

View reviewed changes

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 23, 2024

gabemontero force-pushed the update-log-chan-goroutine-k8sclient branch from 60a8dd1 to b8a991a Compare March 1, 2024 15:31

tekton-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 4, 2024

sayan-biswas reviewed Mar 6, 2024

View reviewed changes

gabemontero mentioned this pull request Mar 8, 2024

add timeout to dynamic Reconcile; switch from CloseSend to CloseRecv in streamLogs; add goroutine dump on timeouts/deadlocks #725

Merged

tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 26, 2024

gabemontero force-pushed the update-log-chan-goroutine-k8sclient branch from b8a991a to a8d623e Compare April 1, 2024 16:38

tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 1, 2024

tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 8, 2024

gabemontero force-pushed the update-log-chan-goroutine-k8sclient branch from a8d623e to 4b55355 Compare April 11, 2024 19:58

tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 11, 2024

tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 16, 2024

gabemontero added 3 commits April 17, 2024 14:37

change default threadiness if logging enabled

1caaaf0

rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED

reintroduce child context with cancel along with CloseAndRecv to forc…

5530c4b

…e goroutine cleanup by not interrupt UpdateLog call before it finishes rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED

gabemontero force-pushed the update-log-chan-goroutine-k8sclient branch from 4b55355 to 5530c4b Compare April 17, 2024 18:39

tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 17, 2024

tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 17, 2024

tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 13, 2024

gabemontero closed this Jul 12, 2024

gabemontero mentioned this pull request Jul 19, 2024

Need some indication of taskruns never leaving pending state before timeouts #779

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

switch from tkn client to k8s pod log retrieval; fix panic, add debug to e2e_gcs_test #715

switch from tkn client to k8s pod log retrieval; fix panic, add debug to e2e_gcs_test #715

gabemontero commented Feb 20, 2024 •

edited

Loading

tekton-robot commented Feb 20, 2024

tekton-robot commented Feb 20, 2024

tekton-robot commented Feb 20, 2024

gabemontero commented Feb 20, 2024

tekton-robot commented Feb 21, 2024

tekton-robot commented Feb 21, 2024

tekton-robot commented Feb 21, 2024

tekton-robot commented Feb 22, 2024

sayan-biswas Feb 23, 2024

avinal Feb 23, 2024

gabemontero Feb 25, 2024

sayan-biswas Feb 25, 2024

gabemontero Feb 26, 2024

tekton-robot commented Feb 23, 2024

sayan-biswas commented Feb 23, 2024

gabemontero commented Feb 25, 2024

sayan-biswas commented Feb 25, 2024

gabemontero commented Feb 26, 2024 •

edited

Loading

tekton-robot commented Feb 26, 2024

tekton-robot commented Mar 1, 2024

gabemontero commented Mar 1, 2024

tekton-robot commented Mar 1, 2024

gabemontero commented Mar 4, 2024

gabemontero commented Mar 4, 2024

sayan-biswas Mar 6, 2024

gabemontero Mar 6, 2024

gabemontero Mar 6, 2024

tekton-robot commented Apr 1, 2024

tekton-robot commented Apr 11, 2024

tekton-robot commented Apr 17, 2024

gabemontero commented Apr 17, 2024

tekton-robot commented Jun 13, 2024

switch from tkn client to k8s pod log retrieval; fix panic, add debug to e2e_gcs_test #715

switch from tkn client to k8s pod log retrieval; fix panic, add debug to e2e_gcs_test #715

Conversation

gabemontero commented Feb 20, 2024 • edited Loading

Changes

Submitter Checklist

Release Notes

tekton-robot commented Feb 20, 2024

tekton-robot commented Feb 20, 2024

tekton-robot commented Feb 20, 2024

gabemontero commented Feb 20, 2024

tekton-robot commented Feb 21, 2024

tekton-robot commented Feb 21, 2024

tekton-robot commented Feb 21, 2024

tekton-robot commented Feb 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tekton-robot commented Feb 23, 2024

sayan-biswas commented Feb 23, 2024

gabemontero commented Feb 25, 2024

sayan-biswas commented Feb 25, 2024

gabemontero commented Feb 26, 2024 • edited Loading

tekton-robot commented Feb 26, 2024

tekton-robot commented Mar 1, 2024

gabemontero commented Mar 1, 2024

tekton-robot commented Mar 1, 2024

gabemontero commented Mar 4, 2024

gabemontero commented Mar 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tekton-robot commented Apr 1, 2024

tekton-robot commented Apr 11, 2024

tekton-robot commented Apr 17, 2024

gabemontero commented Apr 17, 2024

tekton-robot commented Jun 13, 2024

gabemontero commented Feb 20, 2024 •

edited

Loading

gabemontero commented Feb 26, 2024 •

edited

Loading