With TopologySpreadConstraints kubernetes has a tool to spread your pods around different topology domains.
Wait, topology domains? What are those? I hear you, as I had the exact same question.
A topology is simply a label name or key on a node.
A domain then is a distinct value of that label.
If for example we have these 3 nodes
We have the ability to reference the topology
For that topologyKey there are 2 domains:
Now that we know this, how can we use that?
As stated before, we can use TopologySpreadConstraints to spread pods around different nodes.
This way we can for example create a HighAvailability setup (I have a post about that: High availability in Kubernetes). We can also use it to ensure each domain has a certain pod. That might be useful in more performance oriented applications.
How does it do that?
The TopologySpreadConstraints are part of the pod spec, and used during scheduling of the pod.
During scheduling it will take a look at the available nodes. And check what domains there are for the given topology.
In the previous example for te topologyKey
topology.kubernetes.io/zone those were
It will check how many of the pods (matching the given label) are already running on that domain.
Then it checks the
maxSkew or in other words, how big the difference can be between the different domains.
If we have a
maxSkew=1 then when
eu-west-1a has 2 matching pods, and
eu-west-1b has 1 matching pod. Then the we kubernetes will try to schedule the pod on
node-3 as they are part of
eu-west-1b. It will take into account nodeTaints and affinityPolicies by default. More about that in the specs.
So, we’ve seen how to use the TopologySpreadConstraints. Now can we combine multiple TopologySpreadConstraints?
The anwer is: Yes! Yes we can combine them.
When combining TopologySpreadConstraints they act with an and rule.
However, it does take the whenUnsatisfiable constraint into account.
Now take the following example:
There are 2 topologyKeys we are looking at
Now with this config we will make sure that the pods are evenly spread among the different regions, and we try to evenly spread pods among nodegroups. However this is not a requirement.
That means that after we’ve scheduled 5 pods (1 on node-1, 2 on node-2, 2 on node-3), the 6th pod should be scheduled on node 3 again. Otherwise the
topology.kubernetes.io/region constraint cannot be met.
Let’s check out the spec of the the TopologySpreadConstraints: (yaml copied from the kubernetes docs)
MaxSkew is used to determine what the permitted difference is between the defined topologyKeys.
This is an integer and the integer is the amount in pods.
So when you want at max 1 pod per topology domain you should set it to 1
whenUsatisfiable: DoNotScheduleis used. Then the pods will not be scheduled when the difference would become to large.
It is also used as a minimum amount of pods per topology domain! This is only the case when the number of domains is less then the amount of minDomains. Otherwise the minimum number of matching pods is zero.
whenUnsatisfiable: ScheduleAnywayis used, then the scheduler gives higher precedence to topology domains that have less pods (reducing the skew)
Used in combination with
It is a beta feature that is turned on by default (since 1.27).
When enabling can set a positive integer. This is the minimum number of topology domains that need to match for the maxSkew to act with the global minimum.
This is the key used to select the topology. These are node labels for example
topology.kubernetes.io/zone. For a list of well knows labels check here
DoNotSchedulePrevents scheduling when topology skew cannot be within the maxSkew.
ScheduleAnywayMakes maxSkew a recommendation.
Used to find pods that count towards the topology domain. The value of this field will determine what pods the skew will be calculated for.
List of label keys to refine the labelSelector even more. For example you can have a labelSelector for app, and a labelKey for pod-template-hash. This makes it that different revisions of the app will be spread accordingly.
During scheduling, the skew will be calculated by analyzing the available nodes.
When we have 6 nodes, each in 3 different zones, then we check where our pods are already deployed.
With this flag set to Honor, we will only include nodes that adhere to the
So when 2 out of 5 nodes match the affinity selector, then only those 2 pods are included in the spread calculation.
This works the same way with the
When it is set to ignore, then all nodes are taken into consideration when calculating the spread.
How to treat tainted nodes during spread calculation.
There are 2 settings:
Excludes tainted nodes unless the pod has a tolaration for the taint.
Ignores any taint on the nodes. And this includes all nodes that adhere to the constraints configured earlier.
- There’s no guarantee that the constraints remain satisfied when Pods are removed.!
For example, scaling down a Deployment that has 10 pods, is part of 2 topologies with a maxSkew of 1. Then when scaling down to 3 pods, it can happen that all 3 pods are located on a single topology domain.
This can be mitigated by using a tool like Descheduler to rebalance the Pods distribution.
- When calculating what topology domains are available, the scheduler only has knowledge of existing nodes. This could lead to problems with autoscaled clusters, as on those clusters only a minimum amount of nodes are running. Thus not all possible topology domains are available.
There are autoscalers that do have prior knowledge of (certain) topology domains, like karpenter