Active/Passive High Availability

Overview

Initially it was assumed that there will only be a single kuryr-controller instance in the Kuryr-Kubernetes deployment. While it simplified a lot of controller code, it is obviously not a perfect situation. Having redundant controllers can help with achieving higher availability and scalability of the deployment.

Now with introduction of possibility to run Kuryr in Pods on Kubernetes cluster HA is much easier to be implemented. The purpose of this document is to explain how will it work in practice.

Proposed Solution

There are two types of HA - Active/Passive and Active/Active. In this document we’ll focus on the former. A/P basically works as one of the instances being the leader (doing all the exclusive tasks) and other instances waiting in standby mode in case the leader dies to take over the leader role. As you can see a leader election mechanism is required to make this work.

Leader election

The idea here is to use leader election mechanism based on Kubernetes endpoints. The idea is neatly explained on Kubernetes blog. Election is based on Endpoint resources, that hold annotation about current leader and its leadership lease time. If leader dies, other instances of the service are free to take over the record. Kubernetes API mechanisms will provide update exclusion mechanisms to prevent race conditions.

This can be implemented by adding another leader-elector container to each of kuryr-controller pods:

- image: gcr.io/google_containers/leader-elector:0.5
  name: leader-elector
  args:
  - "--election=kuryr-controller"
  - "--http=0.0.0.0:${KURYR_CONTROLLER_HA_PORT:-16401}"
  - "--election-namespace=kube-system"
  - "--ttl=5s"
  ports:
  - containerPort: ${KURYR_CONTROLLER_HA_PORT:-16401}
    protocol: TCP

This adds a new container to the pod. This container will do the leader-election and expose the simple JSON API on port 16401 by default. This API will be available to kuryr-controller container.

Kuryr Controller Implementation

The main issue with having multiple controllers is task division. All of the controllers are watching the same endpoints and getting the same notifications, but those notifications cannot be processed by multiple controllers at once, because we end up with a huge race condition, where each controller creates Neutron resources but only one succeeds to put the annotation on the Kubernetes resource it is processing.

This is obviously unacceptable so as a first step we’re implementing A/P HA, where only the leader is working on the resources and the other instances wait as standby. This will be implemented by periodically calling the leader-elector API to check the current leader. On leader change:

  • Pod losing the leadership will stop its Watcher. Please note that it will be stopped gracefully, so all the ongoing operations will be completed.

  • Pod gaining the leadership will start its Watcher. Please note that it will get notified about all the previously created Kubernetes resources, but will ignore them as they already have the annotations.

  • Pods not affected by leadership change will continue to be in standby mode with their Watchers stopped.

Please note that this means that in HA mode Watcher will not get started on controller startup, but only when periodic task will notice that it is the leader.

Issues

There are certain issues related to orphaned OpenStack resources that we may hit. Those can happen in two cases:

  • Controller instance dies instantly during request processing. Some of OpenStack resources were already created, but information about them was not yet annotated onto Kubernetes resource. Therefore information is lost and we end up with orphaned OpenStack resources. New leader will process the Kubernetes resource by creating resources again.

  • During leader transition (short period after a leader died, but before its lease expired and periodic task on other controllers noticed that; this shouldn’t exceed 10s) some K8s resources are deleted. New leader will not get the notification about the deletion and those will go unnoticed.

Both of this issues can be tackled by garbage-collector mechanism that will periodically look over Kubernetes resources and delete OpenStack resources that have no representation in annotations.

The latter of the issues can also be tackled by saving last seen resourceVersion of watched resources list when stopping the Watcher and restarting watching from that point.

Future enhancements

It would be useful to implement the garbage collector and resourceVersion-based protection mechanism described in section above.

Besides that to further improve the scalability, we should work on Active/Active HA model, where work is divided evenly between all of the kuryr-controller instances. This can be achieved e.g. by using consistent hash ring to decide which instance will process which resource.

Potentially this can be extended with support for non-containerized deployments by using Tooz and some other tool providing leader-election - like Consul or Zookeeper.