Spinnaker green/blue deployment cleanup solution

5 min readJan 2, 2021

Earlier this year, we adopted Spinnaker as our deployment tool at Gusto. By enabling the green/blue deployment on top of the core Kubernetes infrastructure, we have seen a big improvement in terms of the deployment time. Thus, this is definitely a big win for us.

However, Spinnaker uses ReplicaSet to manage its blue/green deployment, which brought us a new set of challenges in managing the deployment pipeline. In Kubernetes, ReplicaSet is a lower abstraction relative to deployment, which means Kubernetes will not be able to clean up the ReplicaSet directly if something goes wrong at the ReplicaSet level. What could go wrong? Let’s start with a simple deploy pipeline.

A simple deploy pipeline

Let’s say we have a simple deployment pipeline that includes three services: WebApp, Stream Consumer, and Workers. All services bake their manifests and deploy simultaneously. We enable the traffic on the new version (candidate), disable traffic on the old version (incumbent), and delete the old version of ReplicaSets.

Before the step of “Enable (Candidate)”, the state of the pipeline is the following

If everything works as expected, the following will happen:

The “Enable (Candidate)” step will add a label to the new versions of ReplicaSets to enable the traffic of all the services.
The “Disable (Incumbent)” step will remove the label that was added from the previous successful deployment.
After the traffic is disabled, the old ReplicaSets will be deleted. Assuming the “disable (incumbent)” step succeeded, no service disruption will incur.

Multiple ways to go wrong

In reality, we have observed several scenarios that can cause the pipeline to fail at different stages, and here are the states of deploy pipeline when they fail.

Scenario 1: Deploy fails at the deploy stage
Scenario 2: Deploy fails at the enable candidate stage
Scenario 3: Deploy fails at the disable/delete incumbent stage

Scenario 1

Occasionally, there is a bad deployment that “bypasses” the basic Kubernetes health check. In this case, the bad deploy will get stuck on the “deploy” stage until it passes the deploy deadline. The traffic is still on the old versions because the new version’s traffic has not been enabled yet.

(I use “T” to mark the ReplicaSets that currently have traffic on)

Scenario 2

This is a very rare case but we did see this happen once or twice. For whatever reason, when Spinnaker tries to put a label on the new version ReplicaSets, it failed. This is a very messy and dangerous case because you have both old and new versions serving traffic in a failed state.

Scenario 3

There is a slim chance that Spinnaker tries to disable traffic on a host that Kubernetes tries to scale down. If that happens, Spinnaker does not “know” the host is actually gone and still tries to remove the label which will never succeed. In this case, the newer version is already serving traffic. Thus, we can configure the Spinnaker pipeline to ignore the failure and continue to delete the old ReplicaSets.

Because Kubernetes will not clean up these ReplicaSets, we will need to figure out a solution to handle each of the failure states. Additionally, timing is important. It is better to do this right after the pipeline failed to avoid multiple failed versions to stack up (or even worse, the old version still serves traffic). The solution we have chosen is: find the most updated version with the traffic and keep that. Delete the rest of the ReplicaSets. This will get the pipeline back to a healthy state in each of the scenarios listed above.

We can easily write a python script to achieve this goal. This solves the first challenge of this problem. But the next question is how to invoke this script and pull the ReplicaSet objects from the Spinnaker pipeline?

How to run the cleanup script safely?

The first thing I have considered is “Run script” via Spinnaker. The Spinnaker “Script stage” lets you run an arbitrary shell, Python, or Groovy script on a Jenkins instance as the first-class state in Spinnaker. However, it will be extra maintenance to configure Jenkins hosts to do that.

The next thought is to use Spinnaker’s run job to run a container. If that works, I just need to Dockerize the Python script and push it to ECR for Spinnaker to pull. Here is the example Manifest of the Spinnaker job:

Additionally, we will need to create a service account for the “internal pod” to run the job (the “internal pod” approach is a very common way to configure custom logic within the cluster). We will need to give the “get” and “list” permission to both “Pods” and “ReplicaSets”.

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: modify-pods
rules:
  - apiGroups: ["apps"]
    resources:
      - pods
      - replicasets
    verbs:
      - get
      - list

One small but important detail is how to make the Python script to talk to the Kubernetes cluster. The simplest way is probably to execute a Shell command and there are multiple ways to do just that. Alternatively, you can also use the Kubernetes Python client which is probably more robust. This library is not very well documented so you may run into some confusion when calling certain APIs. Here is a code snippet to explore if you follow the “agent” pod approach.

config.load_incluster_config()
    v1 = client.AppsV1Api()
    ret = v1.list_namespaced_replica_set("example")

Further thoughts

In Kubernetes’ world, this starts to feel like a “too many chefs in the kitchen” scenario. There is a certain risk associated with running this job because it may accidentally delete something. If you are using Spinnaker to run blue/green deployment and have similar issues, I’d be interested to learn what you do to solve this problem.

(Thanks Vaibhav Mallya for proofreading the post)