With TopologySpreadConstraints kubernetes has a tool to spread your pods around different topology domains.
Wait, topology domains? What are those? I hear you, as I had the exact same question.
A topology is simply a label name or key on a node.
A domain then is a distinct value of that label.
If for example we have these 3 nodes
apiVersion: v1
kind: Node
metadata:
name: node-1
labels:
topology.kubernetes.io/zone: eu-west-1a
---
apiVersion: v1
kind: Node
metadata:
name: node-2
labels:
topology.kubernetes.io/zone: eu-west-1b
---
apiVersion: v1
kind: Node
metadata:
name: node-3
labels:
topology.kubernetes.io/zone: eu-west-1b
YAMLWe have the ability to reference the topology topology.kubernetes.io/zone
.
For that topologyKey there are 2 domains: eu-west-1a
and eu-west-1b
.
Now that we know this, how can we use that?
Using TopologySpreadConstraints
As stated before, we can use TopologySpreadConstraints to spread pods around different nodes.
This way we can for example create a HighAvailability setup (I have a post about that: High availability in Kubernetes). We can also use it to ensure each domain has a certain pod. That might be useful in more performance oriented applications.
How does it do that?
The TopologySpreadConstraints are part of the pod spec, and used during scheduling of the pod.
During scheduling it will take a look at the available nodes. And check what domains there are for the given topology.
In the previous example for te topologyKey topology.kubernetes.io/zone
those were eu-west-1a
and eu-west-1b
.
It will check how many of the pods (matching the given label) are already running on that domain.
Then it checks the maxSkew
or in other words, how big the difference can be between the different domains.
If we have a maxSkew=1
then when eu-west-1a
has 2 matching pods, and eu-west-1b
has 1 matching pod. Then the we kubernetes will try to schedule the pod on node-2
or node-3
as they are part of eu-west-1b
. It will take into account nodeTaints and affinityPolicies by default. More about that in the specs.
Combining TopologySpreadConstraints
So, we’ve seen how to use the TopologySpreadConstraints. Now can we combine multiple TopologySpreadConstraints?
The anwer is: Yes! Yes we can combine them.
When combining TopologySpreadConstraints they act with an and rule.
However, it does take the whenUnsatisfiable constraint into account.
Now take the following example:
apiVersion: v1
kind: Node
metadata:
name: node-1
labels:
nodegroup: group-1
topology.kubernetes.io/region: eu-west-1
---
apiVersion: v1
kind: Node
metadata:
name: node-2
labels:
nodegroup: group-2
topology.kubernetes.io/region: eu-west-1
---
apiVersion: v1
kind: Node
metadata:
name: node-3
labels:
nodegroup: group-1
topology.kubernetes.io/region: eu-east-1
---
apiVersion: v1
kind: Pod
metadata:
name: my-app
labels:
app: app-name
spec:
topologySpreadConstraints:
- maxSkew: 1
labelSelector:
matchLabels:
app: app-name
topologyKey: nodegroup
whenUnsatisfiable: ScheduleAnyway
- maxSkew: 1
labelSelector:
matchLabels:
app: app-name
topologyKey: topology.kubernetes.io/region
whenUnsatisfiable: DoNotSchedule
containers:
- name: pod-across-zones
image: example/app:latest
YAMLThere are 2 topologyKeys we are looking at nodegroup
and topology.kubernetes.io/region
.
Now with this config we will make sure that the pods are evenly spread among the different regions, and we try to evenly spread pods among nodegroups. However this is not a requirement.
That means that after we’ve scheduled 5 pods (1 on node-1, 2 on node-2, 2 on node-3), the 6th pod should be scheduled on node 3 again. Otherwise the topology.kubernetes.io/region
constraint cannot be met.
TopologySpreadConstraints Spec
Let’s check out the spec of the the TopologySpreadConstraints: (yaml copied from the kubernetes docs)
apiVersion: v1
kind: Pod
metadata:
name: my-app
spec:
# Configure a topology spread constraint
topologySpreadConstraints:
- maxSkew: <integer>
minDomains: <integer> # optional; beta since v1.25
topologyKey: <string>
whenUnsatisfiable: <string>
labelSelector: <object>
matchLabelKeys: <list> # optional; beta since v1.27
nodeAffinityPolicy: [Honor|Ignore] # optional; beta since v1.26
nodeTaintsPolicy: [Honor|Ignore] # optional; beta since v1.26
### other Pod fields go here
YAML- maxSkew
MaxSkew is used to determine what the permitted difference is between the defined topologyKeys.
This is an integer and the integer is the amount in pods.
So when you want at max 1 pod per topology domain you should set it to 1 - When
whenUsatisfiable: DoNotSchedule
is used. Then the pods will not be scheduled when the difference would become to large.
It is also used as a minimum amount of pods per topology domain! This is only the case when the number of domains is less then the amount of minDomains. Otherwise the minimum number of matching pods is zero. - When
whenUnsatisfiable: ScheduleAnyway
is used, then the scheduler gives higher precedence to topology domains that have less pods (reducing the skew) - minDomains
Used in combination withmaxSkew
andwhenUnsatisfiable=DoNotSchedule
.
It is a beta feature that is turned on by default (since 1.27).
When enabling can set a positive integer. This is the minimum number of topology domains that need to match for the maxSkew to act with the global minimum. - topologyKey
This is the key used to select the topology. These are node labels for exampletopology.kubernetes.io/zone
. For a list of well knows labels check here - whenUnsatisfiable
DoNotSchedule
Prevents scheduling when topology skew cannot be within the maxSkew.ScheduleAnyway
Makes maxSkew a recommendation.- labelSelector
Used to find pods that count towards the topology domain. The value of this field will determine what pods the skew will be calculated for. - matchLabelKeys
List of label keys to refine the labelSelector even more. For example you can have a labelSelector for app, and a labelKey for pod-template-hash. This makes it that different revisions of the app will be spread accordingly. - nodeAffinityPolicy
During scheduling, the skew will be calculated by analyzing the available nodes.
When we have 6 nodes, each in 3 different zones, then we check where our pods are already deployed.
With this flag set to Honor, we will only include nodes that adhere to thenodeAffinity
selector.
So when 2 out of 5 nodes match the affinity selector, then only those 2 pods are included in the spread calculation.
This works the same way with thenodeSelector
.
When it is set to ignore, then all nodes are taken into consideration when calculating the spread. - nodeTaintsPolicy
How to treat tainted nodes during spread calculation.
There are 2 settings: Honor
Excludes tainted nodes unless the pod has a tolaration for the taint.Ignore
.
Ignores any taint on the nodes. And this includes all nodes that adhere to the constraints configured earlier.
Known limitations
- There’s no guarantee that the constraints remain satisfied when Pods are removed.!
For example, scaling down a Deployment that has 10 pods, is part of 2 topologies with a maxSkew of 1. Then when scaling down to 3 pods, it can happen that all 3 pods are located on a single topology domain.
This can be mitigated by using a tool like Descheduler to rebalance the Pods distribution. - When calculating what topology domains are available, the scheduler only has knowledge of existing nodes. This could lead to problems with autoscaled clusters, as on those clusters only a minimum amount of nodes are running. Thus not all possible topology domains are available.
There are autoscalers that do have prior knowledge of (certain) topology domains, like karpenter
Links
Kubernetes docs
Descheduler
karpenter
Well-Known Labels, Annotations and Taints
High availability in Kubernetes
One Reply to “What are TopologySpreadConstraints?”
Comments are closed.