Kubernetes Advanced Deployment Strategy - Part 1

Kubernetes Advanced Deployment Strategy - Part 1

A customer presented me with a use case concerning an application they run on OpenShift. They have an application that requires at least five instances running at all times to handle their current traffic load. Unfortunately, they couldn't permanently scale this service beyond five pods. However, they still wanted to ensure that they would have enough capacity for the application to continue handling traffic during upgrades. They ran into the issue that the application takes a significant amount of time to come back online and see a performance impact during that restart time. They wanted to know if there might be a way to start a supplemental pod preemptively before Kubernetes shut down the existing pod. In this blog post, I will present several different options for solving this problem.

For those of you that may not be familiar with the internals of Kubernetes, I'd like to provide a little bit of background. In Kubernetes, when a deployment is created, a corresponding replica set is made for that deployment. This replica set monitors the number of pods that are currently running. If that number differentiates from the desired quantity, the replication controller will request a new pod either be created or destroyed. The replica set does not determine which node to place the pod; it simply requests that one is created. The scheduler in Kubernetes determines and deploys the pod on a node.

When you need to restart a worker node for maintenance (or for any reason), the draining process terminates pods on that node as part of the shutdown process. When the draining process removes that pod, Kubernetes will inform the pod's corresponding replica set that the current state does not match the desired state. The replica set will then request a new pod be deployed somewhere else on the cluster. As mentioned earlier, what options do we have to alleviate this issue and prevent having any fewer than five pods running during maintenance?

One approach that I've seen discussed is to increase the current replica set count by one and then reduce it once the node has restarted. This change will cause Kubernetes to schedule an additional pod somewhere in the cluster. Once this pod is up and running, you can go ahead and shut down the node, which will kill the running pod on the node. But as you may have guessed, this will cause the replica set to detect that now there are only five pods instead of six, and it will attempt to spin up a sixth pod when we only need five. Of course, you can reduce your replica set count from six to five, which will cause the replica set to terminate a pod. Hopefully, the pod picked for termination is the pod the scheduler just created. Unfortunately, there is no guarantee which pod the Replica Set will select for termination.

Furthermore, I'm not too fond of this approach as it causes quite a bit of churn in the cluster by quickly scaling up and then scaling back down when you don't need the sixth pod. If there are pods with priorities, this could cause other unrelated pods to be deleted during the scale-up and down. So for all the reasons mentioned above, I'm not too fond of this approach.

Another technique would be to use a custom resource definition (CRD) by creating a CRD and then an underline controller (aka operator) to manage any created custom resources. In this case, the operator would only be slightly modifying how the standard deployment controller works, which means reimplementing a lot of code that the engineers of Kubernetes maintain. It's a lot of work to maintain a codebase for something that should be possible to do right out of the box.

Maybe we could write a custom scheduler. The schedule is responsible for placing pods on nodes based on various conditions. It might be possible to write a custom scheduler that would take advantage of knowing that nodes are marked as unschedulable and reschedule the pod somewhere else. One problem with doing so is that it breaks the separation of concerns model in Kubernetes; a deployment creates replica sets, which creates pods, which in turn the scheduler places on a node. The Kubernetes model uses these abstractions, and if we deviate from that model, we break Kubernetes operating concepts. And if somehow we didn't, we would still end up in a mess trying to figure out what Kubernetes component owns which resource. So again, using a custom scheduler isn't the right solution.

So, the best solution it's relatively simple. Pods managed by a replica set can be removed from the replica set and, therefore, no longer under the replica set's control. The Kubernetes documentation talks about using this feature to isolate a pod for data recovery or debugging purposes, and we're going to use it to solve our use case. Removing a pod from the replica set isolates it but does not cause it to terminate. When we remove the pod from the replica set, Kubernetes notifies the replica set that the current state does not match the desired state, and the replica set will spin up an additional pod. We can allow our orphaned sixth pod to continue to run and process data while the new pod managed to the replica set comes online. Once that pod is online, we can cleanly shut down the orphaned pod. Since the replica set is not responsible for the orphaned pod, the replica set will take no additional actions. A sysadmin can use pod isolation to spin up an extra pod during maintenance to handle the workload without losing capacity or performance.

For my next blog post, I'll walk through the above scenario with examples on how to manually make this happen. And in a future post, I'll discuss how we could automate this process using an operator. Are there other solutions to this problem that I did not present? Please let me know in the comments below.