-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
e2e: Move tests to gh action using azure workers #260
Conversation
The pipeline apparently passed because @ldoktor disable the status report, in reality it has failed in:
|
@ldoktor I suspect the error is at https://github.com/confidential-containers/operator/blob/main/tests/e2e/cluster/up.sh#L60 . The script assume that the assigned node name (see https://kubernetes.io/docs/reference/setup-tools/kubeadm/kubeadm-init/#setting-the-node-name) is equal to Try to print the nodes ( Ah, there are other places where |
d6f27a1
to
fd00a61
Compare
.github/workflows/az-e2e.yaml
Outdated
@@ -0,0 +1,34 @@ | |||
name: azure e2e tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have enclave-cc e2e tests workflow. What it will be testing on this workflow is the ccruntime implementation; so perhaps we can name it ccruntime e2e tests
instead of azure e2e tests
. Or, as we talked another day, name it ccruntime functional tests
. The filename should be renamed properly too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, since the jenkins jobs will be gone it makes sense to avoid specifying az :-)
.github/workflows/az-e2e.yaml
Outdated
|
||
jobs: | ||
e2e: | ||
name: operator azure e2e tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once #299 is merged, we will run only operator tests (install; install and uninstall...etc) so I suggest to just call it operator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see
.github/workflows/az-e2e.yaml
Outdated
jobs: | ||
e2e: | ||
name: operator azure e2e tests | ||
runs-on: az-ubuntu-2204 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm about to add Ubuntu 20.04 runner, so the runner name should be part of the matrix below. I.e. two variation of same job running on Ubuntu 22.04 and 20.04.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added a new runner to serve Ubuntu 20.04. You can use the label az-ubuntu-2004
. I didn't test it works though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are they that different? Wouldn't one suffice? Anyway I'll add that, just asking to save some costs...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ldoktor good question. Ubuntu 20.04 comes with containerd 1.6; by using it we test a feature of the operator that is the installation of containerd 1.7. Whereas on Ubuntu 22.04, containerd 1.7 is already installed.
.github/workflows/az-e2e.yaml
Outdated
|
||
strategy: | ||
matrix: | ||
runtimeclass: ["kata-qemu", "kata-clh"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if makes sense to run operator test for each runtimeClass. But let's leave as is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it'd make sense to include the devels. Unless there are much different code-paths we should perhaps just chose one. Do you know whom to ping for that?
.github/workflows/az-e2e.yaml
Outdated
runtimeclass: ["kata-qemu", "kata-clh"] | ||
|
||
steps: | ||
- uses: actions/checkout@v3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v4 is already available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
.github/workflows/az-e2e.yaml
Outdated
export PATH="$PATH:/usr/local/bin" | ||
./run-local.sh -r "$RUNTIME_CLASS" -u | ||
env: | ||
RUNTIME_CLASS: ${{ matrix.runtimeclass }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it should export RUNTIME_CLASS, the -r parameter to run-local.sh should account for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a name-clash, I wanted to have it available in case we reuse it multiple times. But let me hardcode it.
4d5465e
to
d099345
Compare
@wainersm I addressed all the issues and on top modified the |
@ldoktor it looks great! The tests that require credentials were remove, could you rebase so the run this again? ah, I introduced one more |
Rebased & treated 2 new occurrences of $(hostname) in |
|
Yep, looks like that also looking at the age of the containers I'm wondering whether the operator is really ready (I mean the pods are ready but I'm wondering whether the init is completed at the time the uninstall happens, which might slow the removal...) Let me try doubling the deadlines... |
@wainersm it passed with 4x deadline but after the testing the output shows |
720 seconds to uninstall the operator seems to much time. The fact that sometimes it is not able to finish on that window of time may indicate a legit bug. I noticed the uninstall operator reached the timeout after the tests executed, i.e., when the workflow tries to revert the system to its pre-testing state.
|
Well, trying it on my system (kcli ubuntu VM on a T14s laptop) it usually takes 4.5m to uninstall and 1m to install it. So in unstable cloud environment the 6m seems legit and allowing up to double the time in case of overloaded cloud does not sound all that bad. Perhaps there really isn't a bug (or is but it can recover). Let me run a loop to better examine the timing. |
@wainersm it seems to be stable, uninstall 4.5m and reinstall 50-70s. I think the new deadlines are reasonable and they finish early if the condition is reached. I think it's ready to be merged, what do you think? |
I really appreciated the analysis you did! Yes, I think it is ready to be merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @ldoktor !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - Thanks @ldoktor
For the last time we can run the clh jenkins tests. After this merged I'll disable the project and remove it from required, and after a grace period we should enable the gha workflow tests as required |
/test |
1 similar comment
/test |
Hey @ldoktor - I tried to update this branch after another PR got merged, but now the tests are failing, so I'm not sure if teh auto-merge had issues? It might be worth you doing a re-base and force pushing to remove the extra merge commit, then we can re-try the tests |
use the azure runners provided by "confidential-containers/infra" to run the kata-clh and kata-qemu workflows. Signed-off-by: Lukáš Doktor <[email protected]>
the uninstall timeouts seems to be too low for the azure runners. Signed-off-by: Lukáš Doktor <[email protected]>
Rebased, no changes. |
/test |
We are getting:
on the uninstall test, I'm not sure if that means we need a longer timeout/sleep, or if there is something else going on I'm missing from the debug? |
I think the timeout is really generous now so this might be an actual issue. I haven't got to this problem locally, I'll try to dig deeper tomorrow. |
Still not reproduced but noticed in GH the manager's restart count is 4 while on my machine I have restart count 0. I'll try to stress my machine and perhaps it could be related to that. |
@wainersm @stevenhorsman what would you say about something like this? On azure the manager pod (and others) are restarted several times before they stabilize, which is likely causing the issues on op uninstall. |
Especially on azure workers we are seeing several pod restarts right after CoCo deployment, let's wait for 3x21s which should be enough to detect instabilities as the liveness probe is 15+20s. Signed-off-by: Lukáš Doktor <[email protected]>
/test |
use the azure runners provided by "confidential-containers/infra" to run the kata-clh and kata-qemu workflows.