Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable private service connect #476

Closed

Conversation

kon-angelo
Copy link
Contributor

How to categorize this PR?

/area control-plane
/kind enhancement
/platform gcp

What this PR does / why we need it:
Allows user to configure PrivateServiceConnect for their VPC. The user needs to specify an IP for the endpoint and to enable the service.

Important notes:

  • This allows configuration only for accessing GCP services from the endpoint and it is not possible to consume other services from other VPCs (which technically is also a feature under the umbrella of PSC).
  • The endpoint can be any IP but it a private address makes more sense. In order for the shoot to not be broken this endpoint IP must not collide with (gcp ref):
    • Shoot CIDRS (nodes, services, pods)
    • Seed service CIDR (because of the default/kubernetes service routing)
    • Reverse VPN CIDR ranges

ToDos:

  • infrastucture unit tests
  • infrastructure integration tests
  • Validation unit tests
  • runtime config validator check

Gotchas:
The endpoint and fwd-rule created for PSC have very strict requirements for their options. Due to an unknown condition (likely bug in Terraform provider-gcp) the fwd-rule will be deleted and recreated on infrastructure reconciliation. This may cause issues with open connections to services. The reason for that is that the labels field of fwd-rule has to be empty (specifically for PSC rules) which TF interprets as nil. However the GCP API stores it as "empty" and when TF reads it back during reconciliation it marks it as a different value that the nil stored in state. Hence it recreates the resource.

Because of the seed restriction that this feature imposes we need to wary of transitions during shoot scheduling, which is why we introduce the shoots/binding subresource in the admission controller. We what that:

  • For shoots with no spec.seedName on creation we can validate the endpointIP against only the shoot ranges. We also need to prevent scheduler from assigning it to a seed that it may have collisions during update.
  • For shoots created with spec.seedName, check that it can host this endpoint otherwise prevent creation.
  • Our landscapes usually have seeds using the same ranges, so an unlucky shoot may not find compatible seeds. The user should then check shoot events via kubectl to see why the scheduling fails.

We can also think about leveraging configValidator (aka the runtime check before infrastructure reconciliation). However because this is too late in the process, the shoot may already have been scheduled in an incompatible seed.

Introduce private service connect
Which issue(s) this PR fixes:
Fixes #465

Special notes for your reviewer:

Release note:


@gardener-robot gardener-robot added kind/api-change API change with impact on API users needs/second-opinion Needs second review by someone else area/control-plane Control plane related kind/enhancement Enhancement, improvement, extension platform/gcp Google cloud platform/infrastructure needs/review Needs review size/l Size of pull request is large (see gardener-robot robot/bots/size.py) labels Jul 29, 2022
@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jul 29, 2022
Copy link
Member

@dkistner dkistner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in a private conversation.
At validation time we might not know the Seed (only if the .spec.seedName is set manually in the Shoot resource and not by the gardener-scheduler) which means we cannot reliable validate if cidrs are not clashing.
So I suggest to add an additional check in the configvalidator of the Infrastructure controller to ensure that cidrs are ok. At this point in time the Seed is known and we can early abort (before creating any infrastructure) if the cidrs are clashing.


// PrivateServiceConnect holds the configuration for Private Service Connect endpoints.
type PrivateServiceConnect struct {
Enabled bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc strings.


// PrivateServiceConnect holds the configuration for Private Service Connect endpoints.
type PrivateServiceConnect struct {
Enabled bool `json:"enabled"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc strings.

@@ -145,6 +148,19 @@ func ComputeTerraformerTemplateValues(
"outputKeys": outputKeys,
}

if config.Networks.PrivateServiceConnect != nil && config.Networks.PrivateServiceConnect.Enabled {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the config.Networks.PrivateServiceConnect.Enabled flag at all? Would it not sufficient to enable a private service connect if the endpoint ip is given?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be sufficient but I went for a more explicit enablement. How about this check: if PSC struct it not nil, then both are true:

  • we consider PSC enabled
  • endpointIP must not be empty (enforced via validation).

Comment on lines +147 to +148
// There is an issue currently that TF will always delete and recreate the address during infrastructure reconciliation
// because it cannot handle the labels properly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean for us? Will it always be recreated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is recreated in every infra reconciliation. The reason is that between the terraform and the GCP API the labels value is oscillating between an empty map and a nil value. This makes TF believe that the resource is changed and it recreates it which may interrupt open connections.

I tried all the combinations e.g. specifying the labels as empty map, not specifying them at all and none of them stopped this behavior. Also it is not possible to initialise the map with any kind of value because it is prohibited by the GCP API for endpoints used in PSC

@@ -62,6 +66,9 @@ networks:
# natIPNames:
# - name: manualnat1
# - name: manualnat2
# privateServiceConnect:
# enabled: true
# endpointIP: 10.252.0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question will be multiple google services accessible via the same endpoint ip? Or do we need for each service we want to connect privately an own endpoint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All services should be accessible by the endpoint. For some you may need to create a private DNS zone so that clients can be configured to use them properly

// - Seed's service CIDRs
// - The VPC CIDR (the sub of the subnet CIDRs used by workers)
// - The static ranges used by reverse VPN (192.168.123.0/24)
func ValidatePrivateServiceConnect(fldPath *field.Path, infra *apisgcp.InfrastructureConfig, ranges []cidrvalidation.CIDR) field.ErrorList {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is sufficient to pass the PrivateServiceConnect instead of the entire infra. Also the is enabled check is not necessary as we already checked before if private service connect shall be used.

@gardener-robot gardener-robot added the size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) label Aug 23, 2022
@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Aug 23, 2022
@gardener-robot gardener-robot removed the size/l Size of pull request is large (see gardener-robot robot/bots/size.py) label Aug 23, 2022
@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Aug 23, 2022
@kon-angelo
Copy link
Contributor Author

/test

@testmachinery
Copy link

testmachinery bot commented Aug 23, 2022

Testrun: e2e-xwznd
Workflow: e2e-xwznd-wf
Phase: Failed

+---------------------+---------------------+-----------+----------+
|        NAME         |        STEP         |   PHASE   | DURATION |
+---------------------+---------------------+-----------+----------+
| bastion-test        | bastion-test        | Succeeded | 9m9s     |
| infrastructure-test | infrastructure-test | Failed    | 13m5s    |
+---------------------+---------------------+-----------+----------+

@kon-angelo
Copy link
Contributor Author

After discussion we decided to close the PR for now as we are not sure it will have the expected results. We will try to bring #489 first and then consider again about PSC.

@kon-angelo kon-angelo closed this Sep 7, 2022
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Sep 7, 2022
@asifhasnaincer
Copy link

Do we have any timeline to completion of the PSC feature? A ticket is already opened for this feature from our side.

@kon-angelo kon-angelo deleted the enh/private-service-connect branch February 29, 2024 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Control plane related kind/api-change API change with impact on API users kind/enhancement Enhancement, improvement, extension needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/review Needs review needs/second-opinion Needs second review by someone else platform/gcp Google cloud platform/infrastructure size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add configuration for Private Service Connect in GCP InfrastructureConfig defintition
5 participants