Kubernetes Node Failover

What is the Problem?

In the configuration of Kubernetes Cluster, master nodes are fetch the state of the nodes periodically. The default monitoring period of master nodes is 40second.

NotReadyInstance
ip-172-31-44-210   NotReady   <none>                 7h31m   v1.20.2
Taints:             node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unreachable:NoSchedule

How I fixed the problem?

I’ve checked Kubernetes GitHub issues and google replies and I find `pod-eviction-timeout` parameters but it is not worked properly.

- --enable-admissionplugins=NodeRestriction,DefaultTolerationSeconds
- --default-not-ready-toleration-seconds=40
- --default-unreachable-toleration-seconds=40
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 40
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 40
NAMEREADY   STATUS    RESTARTS   AGE        NODE
nginx-deployment-7944668857-2pdnt 1/1 => ip-172-31-46-223
NAMEREADY   STATUS    RESTARTS   AGE        NODE
nginx-deployment-7944668857-2pdnt 1/1 => ip-172-31-46-223
nginx-deployment-7944668857-2pdnt 0/1 => TERMINATING
nginx-deployment-7944668857-cwggz 0/1 => PENDING
nginx-deployment-7944668857-cwggz 0/1 => CONTAINER-CREATING
nginx-deployment-7944668857-cwggz 1/1 => RUNNING ip-172-31-41-27

Summary;

Kubernetes default configuration aware the node failure situation in very short term but evict the pods and recreate the missing replica process take 5 minute at this new approach at this blogpost is able to us decrease this period.

Resources:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store