Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4962: Standardizing the Representation of Cluster Network Topology #4965

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

dmitsh
Copy link

@dmitsh dmitsh commented Nov 15, 2024

  • One-line PR description:
    Standardizing Cluster Network Topology Representation
  • Other comments:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Nov 15, 2024
@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Nov 15, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @dmitsh!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 15, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @dmitsh. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Nov 15, 2024
@dmitsh
Copy link
Author

dmitsh commented Nov 15, 2024

/cc @aojea

@dmitsh
Copy link
Author

dmitsh commented Nov 15, 2024

@k8s-ci-robot
Copy link
Contributor

@dmitsh: GitHub didn't allow me to request PR reviews from the following users: tardieu, arsenetar, brianhammons.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @brianhammons @mwielgus @tardieu @mickeyboxell @arsenetar

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.


### Network QoS Annotation
Format: `network.qos.kubernetes.io/switches: <QoS>`
- `<QoS>`: A JSON object where each key is a switch name (matching the network topology label) with a value containing:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this object contains N items (of below structure), where N is the number of predefined topology units (accelerator, block, datacenter, zone), right?

What I want to ensure is that those needs to be changed/updated when the cluster grows/shrinks (i.e. that the don't define distance between nodes themselves). But given that for a given node its contents don't depend on other nodes (rather on placements in the physical network), this seems to be fine.

Copy link
Author

@dmitsh dmitsh Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wojtek-t , that's correct, each node will contain QoS metrics between the node and every reachable switch.

@aojea
Copy link
Member

aojea commented Nov 18, 2024

/assign @johnbelamaric

for sig-architecture

- `<switch-name>`: Unique identifier for the switch

### Network QoS Annotation
Format: `network.qos.kubernetes.io/switches: <QoS>`
Copy link
Member

@aojea aojea Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really the switch, right, is the interface in the node that connects to the switch ... and we already have properties to define attributes on the network interfaces with DRA, see slice 14 https://docs.google.com/presentation/d/1Vdr7BhbYXeWjwmLjGmqnUkvJr_eOUdU0x-JxfXWxUT8/edit#slide=id.g2f750386db2_5_0 so I don't feel we need this additional annotation here

cc: @johnbelamaric @thockin

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a NIC on the node. These are QoS metrics from the node to every reachable switch. Also, the "switch" in this context could be a physical network device, or an aggregated entity defined by a CSP. For example, AWS returns 3 levels of switches per node, but the actual number of physical switches is unknown.

Copy link
Member

@aojea aojea Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are QoS metrics from the node to every reachable switch

how is the Node connected to the first switch :) ?

Comment on lines 270 to 286
network.qos.kubernetes.io/switches: {
"nvl10": {
"latency": "2us",
"bandwidth": "100Gbps"
},
"sw11": {
"latency": "50us",
"bandwidth": "40Gbps"
},
"sw21": {
"latency": "500us",
"bandwidth": "20Gbps"
},
"sw31": {
"latency": "1ms",
"bandwidth": "10Gbps"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are network interfaces on the node, I think we should better model them with DRA https://github.com/kubernetes/enhancements/pull/4965/files#r1846865095 , that also allows us to provide dynamic capabilities to the interfaces

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, as mentioned in the earlier comment, these QoS numbers represent node-to-reachable-switch metrics. They are not per NIC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it mean that some switches may not be connected directly?
how then the latency and bandwidth are obtained to guarantee those values?

Comment on lines 188 to 190
Format: `network.topology.kubernetes.io/<nw-switch-type>: <switch-name>`
- `<nw-switch-type>`: Logical type of the network switch (can be one of the reserved names or a custom name)
- Reserved names: `accelerator`, `block`, `datacenter`, `zone`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the part we need to loop in SIG architecture, I briefly touched on this with @thockin and it seems it took some time to settle on the region/zone.

So my understanding is that we need to model a hierarchy, this KEP suggest to use nested structures: zone > datacenter > block > accelerator , but we should at least describe in alternative why weights are not better than this, it seems to me that weights are more easy to standardize different topologies, as you only focus on the distance provided between layers, and can support different architectures with multiple layers

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This KEP is proposing to use reserved network types for typical network architectures, while allowing to extend network topology using custom network types.
We are providing means for weighted approach by specifying distance and/or bandwidth, latency or other metrics.
These are actual measurable physical characteristics of the network, and will be more accurate than specifying static weights.
Once again, we are providing QoS between a node and a switch, so the distance is the number of hops between the node and the switch. Same goes for bandwidth/latency.

Comment on lines 327 to 329
This proposal is designed with extensibility in mind, enabling the use of custom network types. This ensures that the standard can adapt to future advancements in cluster networking without requiring significant overhauls.

For custom network types, Network QoS Annotations are required, with distance being the minimum mandatory metric. Specifying latency and bandwidth is optional, but including them can offer a more detailed view of link performance, enabling more efficient scheduling decisions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUIC the use of custom network types means that environment that use them may not be compatible with other environments, that practically removes all the benefits of standarization, another thing where I think weighted/distances model can help better

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was already addressed.


The same network topology depicted in Example 2 can be represented using custom network types.

Let's use `tor` for top-of-rack switches, `area` for the second level of switches, and `center` for the third level.
Copy link
Member

@aojea aojea Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After seen this example I'm definitevely not in favor of custom types as it is impossible for a generic tool to infer the distance between these custom types ... it will also create fragmentation and incompatibility as multiple tools can define the same name with different meaning ... the example also talks about levels, that reinforces my idea of weights, so something like network.topology.kubernetes.io/tier: 1

In this example network.topology.kubernetes.io/tor: sw13 will be

  • Node: network.topology.kubernetes.io/tier: 1
  • ResourceSlice:
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceSlice
…
spec:
    devices:
    - basic:
        attributes:
          tier:
            int: 1
          type:
            string: nvlink
          ip:
            string: 10.1.1.1/24
          latency:
            string: "50us"
          bandwith:
            string: "100gbps"
          

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using custom network types, it is mandatory to define distance in the QoS annotation.
We explicitly expressed that in the KEP.

- Gang-scheduling auto-scaler
- DRA scheduler plugin

### Goals
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there existing topology formats or consumers that we want Kubernetes to integrate with? If so, are these integrations goals or non-goals?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the best of my knowledge, only EKS exposes their custom node labels for network layers.
The goal of this KEP is to create a standard way of describing switch network topology.
Ultimately, we want to use this standard in development of Kubernetes-native network-aware scheduler plugin for multi-node workloads. We put this task as a "no goal", as it would be an independent effort.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately, we want to use this standard in development of Kubernetes-native network-aware scheduler plugin for multi-node workloads.

I'm (still) not seeing why that development work requires a KEP. An out-of-tree, out-of-project custom scheduler could use labels with keys outside of Kubernetes namespaces.

@aojea
Copy link
Member

aojea commented Nov 20, 2024

I don't see the case for the annotations and labels and conventions be part of Kubernetes and maintained as part of Kubernetes.

Let me write my thoughts on this:

I see value in a predefined set of labels that describe topology, similar to what we have with zone and region, we can have a number of levels that represent proximity for network workloads, the KEP introduces:

1, accelerator: Network interconnect for direct accelerator communication (e.g., Multi-node NVLink interconnect between NVIDIA GPUs)
2. block: Rack-level switches connecting hosts in one or more racks as a block.
3. datacenter: Spine-level switches connecting multiple blocks inside a datacenter.
4.zone: Zonal switches connecting multiple datacenters inside an availability zone.

It can be other names, but having a number of predefined levels standard across environments will benefit the projects that implement scheduling based on network topology, I think 4 levels should be ok for most environments.

And now the friction points:

  1. Custom types or arbitrary number of levels, we all know that is hard to represent all the topologies, but if we leave freedom to add custom types then this solution no longer works because it means any deployment will be different, or there can be collision between environments with the same types but different meanings. It also will allow people to use official kubernetes labels for a completely different purpose, it is not likely the proposal of custom types will be accepted, I personally will not be in favor.

  2. The annotation suggestions misses who should add the annotation. It also relies on subjective and inconsistently measured attributes, we need things that can be measured the same way everywhere. The NIC or Interface of the node have some physical attributes that influence its potential performance, but the actual latency, bandwidth, and speed depends on a combination of factors, including the network environment and other devices involved. Determining the latency, bandwidth, and speed in a multi-tier network architecture adds another layer of complexity, since it also depends on the hardware, protocols, interconnections and it most probably has dependencies or shared resources that will vary over time.

@sftim
Copy link
Contributor

sftim commented Nov 20, 2024

OK, but responding to #4965 (comment): if the labels (and annotations?) were inside networking.foundation.example and not qos.kubernetes.io, topology.kubernetes.io etc: what do we lose?

If we can't articulate that, we should have pause. Maybe some other project should define the standard (not Kubernetes).

Comment on lines +928 to +930
<!--
Why should this KEP _not_ be implemented?
-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will we describe multihomed nodes?

For example, a node directly connects to two neighbours, and to two top of rack switches. Pod to pod communication can use either best-effort routing or computed paths with source route metadata added to packets on egress.

Kubernetes doesn't say you can't do that, but this standard implies a hierarchical network layout. Are we tacitly backing a particular network management paradigm?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<!--
Why should this KEP _not_ be implemented?
-->
This proposal assumes that the network topology is hierarchical.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not accurate.
You can easy describe a node connected to multiple switches:

A simple example, where we assume that performance of switches is comparable:
Labels:

network.topology.kubernetes.io/level_a: sw01
network.topology.kubernetes.io/level_b: sw02

Annotations:

network.qos.kubernetes.io/switches: {
   "sw01": {
      "distance": 1
   },
   "sw02": {
      "distance": 1
   }
}

An example, where we provide performance metrics:
Labels:

network.topology.kubernetes.io/level_a: sw01
network.topology.kubernetes.io/level_b: sw02

Annotations:

network.qos.kubernetes.io/switches: {
   "sw01": {
      "distance": 1,
      "bandwidth": "40Gbps"
   },
   "sw02": {
      "distance": 1,
      "bandwidth": "60Gbps"
   }
}

@aojea
Copy link
Member

aojea commented Nov 20, 2024

OK, but responding to #4965 (comment): if the labels (and annotations?) were inside networking.foundation.example and not qos.kubernetes.io, topology.kubernetes.io etc: what do we lose?

I do see value on topology.kubernetes.io/zone and topology.kubernetes.io/region . It made a difference to deploy High Available applications and consolidate environments, tooling and projects across the ecosystem.

So I think that aiming for something similar for the new type of workloads that require a more granular definition of network topology makes sense ... I really think we should work on what are the new region and zone for these type of workloads first , and adding four new well known labels makes sense to me, I think I can try to get consensues from Google side in 4 labels (@wojtek-t WDYT) , but we'll need people from other cloud providers to agree (@dims , @ritazh , ...) ... this only works if we all agree

@sftim
Copy link
Contributor

sftim commented Nov 20, 2024

Following up #4965 (comment)

I still have concerns; I'll explain more.

We can encourage cloud providers to define their own labels as well.
For example, on AWS a node can be labelled with topology.kubernetes.io/zone and with topology.k8s.aws/zone-id; labelling like that is of course additive.

I agree about the value topology.kubernetes.io/zone and topology.kubernetes.io/region; if we didn't have those, they would be roughly 100% worth defining. With those existing labels, there is also much less risk of wanting to assert that a node is in two of something. A node might connect to two switches, but we very very rarely put nodes in two regions. (If it's in two regions, why did you put your datacenter in Baarle-Hertog or wherever 😉?)

So this KEP wouldn't persuade providers and / or hardware vendors to define their own parallel, provider-specific labels - but encouraging that kind of guidance would bring something on top of a common denominator of labels that Kubernetes does define.

I remain worried about tacitly promoting one network paradigm (a switching hierarchy), especially for a project that might be around another 10 years. This KEP should articulate its alternatives and make clear why we picked the approach we select.

@aojea
Copy link
Member

aojea commented Nov 20, 2024

I remain worried about tacitly promoting one network paradigm (a switching hierarchy),

agree, but I see this is about network hierarchy, switching hierarchy is just an implementation of a network hierarchy

Level 1: there are things that are closest , this can be a switch a rack or the PCI bus, the implementer decide this
Level 2: there are things that have to go to hops to get to level 1, this connect multiple level 1, the number of hops is abstracted, you just know your are not at the same level of the things in level 1, the implementer decide this
Level 3: same as level 2 for level 1, implementer decide, it connects level 2s
Level 4: ...

@sftim
Copy link
Contributor

sftim commented Nov 20, 2024

I guess as an actual alternative, we could define an API for representing network topology.

For example, custom resources:

  • NetworkTopologyDomain (eg: a row of racks in a datacenter; eg an abstract node placement group)
  • InterconnectionClass (eg Ethernet, 1000Mbs, best effort)
  • PacketForwarder (switches and routers)

Maybe also:

  • OverlayLink (eg internal VPNs)
  • LinkBindingClaim ("this Pod should / does use that interface")
  • IPAddressRangeAssignment (obvious I hope)
  • PathReservation (maybe "please allow this Pod to send 2Gbps to this destination")
  • NetworkIsolationGroup (for abstractions like hardware-accelerated packet filtering, or for exemptions from an otherwise zero-trust architecture)
  • either or both:
    • InterfacePair (eg WAN links between zones)
    • InterfaceLinkSet (the above, plus shared-media examples like wireless Ethernet)

Remember, alternatives in KEP feedback don't have to be the choice we select. They can instead be viable options we rule out but still list as alternatives we considered.

@bowei
Copy link
Member

bowei commented Nov 21, 2024

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 21, 2024
@dmitsh
Copy link
Author

dmitsh commented Nov 24, 2024

I remain worried about tacitly promoting one network paradigm (a switching hierarchy), especially for a project that might be around another 10 years. This KEP should articulate its alternatives and make clear why we picked the approach we select.

I want to stress that we are not advocating for or favoring any specific network hierarchy. On the contrary, our goal is to make the system flexible and adaptable to any type of network.

To achieve this, we propose using reserved types for common configurations, enabling simplicity and brevity. Additionally, we recommend supporting custom types as a more generic solution, where key details like the number of hops (as a minimum input) and additional performance metrics can be specified.

This information will be crucial for optimally scheduling multi-node training jobs or sets of interdependent, data-intensive services. It ensures that network-aware decisions can be made to enhance performance and efficiency.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 24, 2024
@aojea aojea changed the title KEP-4962: Standardizing the Representation of Cluster Switch Network Topology KEP-4962: Standardizing the Representation of Cluster Network Topology Nov 26, 2024
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 26, 2024
@@ -154,13 +154,20 @@ Items marked with (R) are required *prior to targeting to a milestone / release*

## Summary

This document proposes a standard for declaring switch network topology in Kubernetes clusters, representing the hierarchy of nodes, switches, and interconnects. In this context, a switch can refer to a physical network device or a collection of such devices with close proximity and functionality.
This document proposes a standard for declaring switch network topology in Kubernetes clusters,
Copy link
Member

@aojea aojea Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the reference to switch network topology as has connotations, I think we are talking here about "Hierarchical Network Architectures", as we have hierarchy of "networks" with different levels, and the components at the same level should expect the same network properties in terms of performance

- `latency`: Link latency (e.g., 200 ms), optional
- `bandwidth`: Link bandwidth (e.g., 100 Gbps), optional
Format: `network.topology.kubernetes.io/<nw-switch-type>: <switch-name>`, where
- `<nw-switch-type>` defines the characteristic of the network switch. The term `switch` can refer to a physical network device or a collection of such devices with close proximity and functionality.
Copy link
Member

@aojea aojea Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hierarchy-layer-type?

@dmitsh dmitsh force-pushed the ds-topology branch 5 times, most recently from 8d2ba06 to 6253878 Compare November 26, 2024 23:05

## Summary

This document proposes a standard for declaring switch network topology in Kubernetes clusters, representing the hierarchy of nodes, switches, and interconnects. In this context, a switch can refer to a physical network device or a collection of such devices with close proximity and functionality.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubernetes has generally shied away from describing physical attributes vs user intent.

One of the challenges that this proposal needs to address is how this can be future-proofed against changes in underlying technology.


Beyond CSPs, certain on-premises clusters support network topology discovery, though this capability depends on the features of the underlying switch network vendors.

An open-source project, [Topograph](https://github.com/NVIDIA/topograph), has implemented these approaches and is successfully deployed in production environments.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to discuss what this proposal does differently than topograph?

  • Is it porting over the general concepts? What worked/did not?
  • Does it address the problems with the topograph project?

Signed-off-by: Dmitry Shmulevich <[email protected]>
@k8s-ci-robot
Copy link
Contributor

@dmitsh: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-enhancements-test 623396b link true /test pull-enhancements-test
pull-enhancements-verify 623396b link true /test pull-enhancements-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Having these labels available in Kubernetes clusters will help in designing cloud agnostic scheduling systems.
The scheduler will prioritize switches according to the order outlined above, providing a standardized approach for network-aware scheduling across a range of configurations.

### User Stories (Optional)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### User Stories (Optional)
### User Stories

Comment on lines +273 to +274
The scheduler plugin reconstructs the cluster network topology by interpreting `network.topology.kubernetes.io/...` node labels.
Using this topology information, it optimally binds pods to suitable nodes, reducing overall latency and improving performance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the solution more than mart of the story.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/network Categorizes an issue or PR as relevant to SIG Network. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants