Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Self-Hosted Runners on GHA Workflows with Kubernetes Driver Cancled Context #290

Open
3 tasks done
pjohnsonrxb opened this issue Jan 5, 2024 · 5 comments
Open
3 tasks done

Comments

@pjohnsonrxb
Copy link

pjohnsonrxb commented Jan 5, 2024

Contributing guidelines

I've found a bug, and:

  • The documentation does not mention anything about my problem
  • There are no open or closed issues that are related to my problem

Description

Issue: Self-Hosted Runners on GHA Workflows with Kubernetes Driver

Background

We have configured our GitHub Actions (GHA) workflows to use self-hosted runners. Our typical workflow involves:

  • Installing buildx
  • Building, pushing, and caching with buildx

Problem

We are encountering an issue when using the Kubernetes (k8s) driver for our builds. Our self-hosted runners are deployed on our k8s cluster. We're experiencing a specific error as shown in the screenshot below:

Error Screenshot

Kubernetes Container Logs:

time="2023-11-30T22:28:11Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Canceled desc = context canceled"

Hypothesis

We suspect that the issue might be related to our runners being behind a VPN. It seems buildx may not be adequately handling network latency associated with a VPN connection.

Observations

  • The issue is isolated to our runners or the k8s driver/buildx combination. This is evident because switching to GitHub's hosted runners resolves the issue, indicating no problems with our workflow or Dockerfile.
  • The failure isn't consistent; approximately 1 in 5 actions encounter this issue. Sometimes the action completes successfully.

References

For additional context, see this related issue.


Seeking insights or suggestions to resolve this intermittent failure with our self-hosted runners in GHA workflows.

Expected Behavior

When using self-hosted runners in GitHub Actions workflows with the Kubernetes (k8s) driver for buildx, we expect the following:

  1. Stable Connection to Build Services: The runners should maintain a stable connection to Docker's build services, regardless of being behind a VPN. Network latency typically associated with VPN connections should not disrupt the build process.

  2. Consistent Build Process: Each action initiated by the workflow should complete successfully without intermittent failures. The build, push, and cache processes via buildx should be executed reliably.

  3. Error-Free Operation: The buildx command, especially when interacting with Kubernetes, should execute without returning errors like /moby.buildkit.v1.Control/Solve returned error: rpc error: code = Canceled desc = context canceled.

  4. Consistency with GitHub Hosted Runners: The performance and reliability of builds using self-hosted runners should be comparable to those observed with GitHub's hosted runners.

The expectation is that the self-hosted runners on our Kubernetes cluster should work as efficiently and reliably as GitHub's hosted runners, ensuring a smooth CI/CD pipeline.

Actual Behavior

When using self-hosted runners in GitHub Actions workflows with the Kubernetes (k8s) driver for buildx, we are encountering the following issues:

  1. Unstable Connection to Build Services: The runners, especially when operating behind a VPN, are experiencing unstable connections to Docker's build services. This is evident from frequent connection cancellations and errors during the build process.

  2. Inconsistent Build Process: The actions initiated by the workflow are not completing consistently. Approximately 20% of the actions (1 in 5) fail intermittently, showcasing a lack of reliability in the build, push, and cache processes via buildx.

  3. Frequent Errors: We are frequently encountering errors such as /moby.buildkit.v1.Control/Solve returned error: rpc error: code = Canceled desc = context canceled. These errors suggest issues with the interaction between buildx and Kubernetes.

  4. Disparity with GitHub Hosted Runners: Unlike the smooth operation observed with GitHub's hosted runners, our self-hosted runners exhibit inconsistent and error-prone behavior, leading to a disrupted CI/CD pipeline.

In summary, our self-hosted runners on the Kubernetes cluster are not performing as efficiently or reliably as expected, particularly in comparison to GitHub's hosted runners.

Repository URL

No response

Workflow run URL

No response

YAML workflow

name: Build and Push Docker Image

on:
  workflow_call:

jobs:
  build-and-push-image:
    runs-on: [gha-runner-scale-set]
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
            fetch-depth: 0

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1

      - name: Login to Amazon ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@v1

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Set ECR repository name
        id: set_repo_name
        run: |
            REPO_NAME="${{ github.event.repository.name }}"
            ECR_REPO_NAME="${REPO_NAME//./-}"
            echo "ECR_REPO_NAME=$ECR_REPO_NAME" >> $GITHUB_ENV
                             
      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          push: true
          tags: <ecr>/${{ env.ECR_REPO_NAME }}:${{ github.sha }}
          context: .
          build-args: |
              GITHUB_UN=${{ secrets.GITHUBUSERMAME }}
              GITHUB_PW=${{ secrets.GITHUBPASSWORD }}
          cache-from: type=registry,ref=<ecr>/${{ env.ECR_REPO_NAME }}/cache:dockercache
          cache-to: type=registry,ref=<ecr>/${{ env.ECR_REPO_NAME }}/cache:dockercache,mode=max,image-manifest=true

Workflow logs

No response

BuildKit logs

No response

Additional info

Also it is important to note that this job only ever cancels when doing build and push. We use actions for other things and the actions never just cancel for no reason.

@acrogenesis
Copy link

We get the same problem in our ARM64 self-hosted workflows, although our k8s cluster is not behind a VPN

@elocke
Copy link

elocke commented Jul 25, 2024

We're seeing the same issues, both with and without buildx. Can't pinpoint an exact cause. On AWS behind a VPC/transit gateway etc but no VPN. platform: amd64

@0xLE
Copy link

0xLE commented Jul 27, 2024

Try specifying the builder explicitly:

- name: Set up Docker Buildx
  uses: docker/setup-buildx-action@v3
  id: builder

- name: Build and push
  uses: docker/build-push-action@v6
  with:
    # ...
    builder: ${{ steps.builder.outputs.name }}

@andresrsanchez
Copy link

Same issue here, solved with a retry step :(

@davhdavh
Copy link

same problem with self-hosted windows build-runner sending context to linux buildkitd on same LAN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants