katalog-sync: Reliable Integration of Consul and Kubernetes
Why use consul with Kubernetes (k8s)?
Consul is a well-known and widely used service discovery mechanism. Here at Wish, we have standardized on using consul as our service discovery system for quite some time. Although k8s has a built-in service discovery mechanism, we want to continue our usage of consul as the primary service discovery mechanism. This way services in k8s are discoverable outside of k8s and aren’t tied to a specific cluster. Now when a service needs to ramp in and out of k8s they can do so gradually.
The previous solution: sidecar consul-agent
When we launched k8s we decided to add a consul-agent sidecar to each pod. In our k8s environment, each pod has a routable IP in our VPC and as such this functions pretty well. However, after having used this for several months we have noticed a few pain points:
- Configuration: Each service/namespace in k8s needs to have the same consul configuration (client configuration, encryption key, etc.). We largely dealt with this through jsonnet templating, but even so, we ended up having the encryption key in each namespace and a fair amount of configuration duplicated across services.
- Complexity: Using this sidecar approach means that we now have a full-blown consul node for each pod in k8s. For the initial migration to k8s things were generally moved over 1:1, EC2 instance → pod, but as we continued to refine our sizing etc. we ended up having significantly more pods than we had EC2 instances before. In addition, this means we effectively had N nodes participating in the consul memberlist running on the same instance or hardware.
- Failure modes: With a consul-agent sidecar on each pod we can run into thundering herd issues in consul failure modes due to the large number of nodes in the cluster.
- Noisy alerts: Consul’s memberlist expects members to be more-or-less long-lived. Deregistration of a node in the memberlist takes (by default) 72h which means that a node will still be part of the memberlist even after leaving intentionally. In practice, this is a nuisance as the node still shows up in consul’s service discovery until it drops off (e.g. Prometheus’ consul discovery).
- Consul checks vs k8s checks: Probably the most painful issue we’ve run into is configuring consul checks. K8s itself has concepts of liveliness and readiness which are used within k8s to manage the pods themselves. In addition to this k8s readiness, we also needed to configure consul so it would add/remove the service from rotation based on the pod’s readiness. Operationally this is painful to keep in-sync as consul and k8s offer different mechanisms for health checks.
Looking for alternatives: consul-k8s
At the end of last year hashicorp announced consul-k8s as a mechanism to sync services to/from k8s and consul. We were excited to switch to a more k8s-native mechanism for syncing state to consul, and quickly started prototyping with it. Going into it we listed our requirements as:
- Configuration through k8s annotations
- Readiness sync
- High availability with no single point of failure (SPOF)
The good
Consul-k8s offers mechanisms to sync both from k8s → consul and consul → k8s. We don’t have a need for consul → k8s, so we’ll focus on the k8s → consul sync. Consul-k8s sync is focused on syncing services from k8s → consul. This means that you can configure syncing etc. at the service-level in k8s through annotations. For example (borrowed from here):
kind: Service
apiVersion: v1
metadata:
name: my-service
annotations:
"consul.hashicorp.com/service-name": my-consul-service
This configuration-through-annotation both dramatically simplifies templating and is significantly easier to understand.
The bad
Unsurprisingly (since we are writing this post) we ran into some issues while testing out consul-k8s. Initially, we ran into some issues with multi-cluster support but those were resolved relatively quickly. After getting a proof of concept working with multi-cluster support we started some failure mode testing. During this testing, we found 2 major issues:
- Consul-k8s has no liveliness/readiness checks for its sync process.
- Consul-k8s has no mechanism to mitigate outage impact.
In addition to those issues, we found a requirement we didn’t know we had! With the sidecar consul-agent approach if the consul-agent was unable to join the cluster for some reason the pod would fail, and k8s would halt the deployment. Consul-k8s, however, is a single-process for the cluster which asynchronously syncs state from k8s to consul.
- Kubelet starts container on Node
- Kubelet updates k8s API
- Consul-k8s notices change in k8s-api
- Consul-k8s pushes change to consul
This means the ability of consul-k8s to sync the k8s state to consul is completely independent of the k8s pod deployments. This implies that we could easily create scenarios where the entire service would complete a rolling update (with new pod IPs, etc.) without that state being synced to consul. This means that we could get into a state where service discovery has 0 correct entries in it so clients would be unable to connect to the service ????! We realized that this was a deal-breaker for us and (unfortunately) the single-syncer architecture does not offer a solution to this issue.
Finally, a solution: katalog-sync
At this point it was clear that we’d have to implement our own mechanism for syncing things from k8s → consul’s catalog, and we picked up one additional requirement, the ability to stop deploys in k8s from completing if we are unable to sync to consul.
Design
The fundamental flaw in consul-k8s for us was the single-syncer design. This single-syncer implementation was done communicating entirely through the k8s API server. This design brought up 2 major concerns:
- failure duration, in the event of a failure we want to ensure that we can quickly recover,
- failure impact, in the event of a failure we want to scope its impact as small as possible.
Consul-k8s had relied on k8s’ deployment to handle 1.), but had no solution to 2.) as the consul-k8s process itself is a SPOF.
So to address these reliability issues we decided to take a different approach. Instead of building a cluster-wide syncer, we decided to build a node-local syncer. In our environment (as with most consul + k8s environments) we actually already run a consul-agent as a daemonset on each k8s node for use by node-local Prometheus exporters (e.g. node_exporter). The kubelet itself has an API for which pods are running locally, which means that we have both k8s API and consul local to each node.
- Kubelet starts container on Node
- (optional) katalog-sync-sidecar calls to katalog-sync-daemonset waiting until registration with consul is complete
- Daemonset syncs changes from kubelet through the local kubelet API
- Daemonset syncs changes to consul
In this setup, katalog-sync would pull the pods locally from the kubelet and sync their state (configured through annotations) to the local consul-agent’s service API. An example annotation would look like:
kind: Service
apiVersion: v1
metadata:
name: my-service
annotations:
katalog-sync.wish.com/service-names: my-service
Using this annotation, katalog-sync then syncs the “ready” state from k8s to the status of the TTL check in consul.
This solves the reliability concern but currently doesn’t account for our requirement for controlling pod deployment status. To control pod lifecycle we need to actually be in the pod itself and to do this we need to introduce a sidecar. Katalog-sync-sidecar simply has a readiness endpoint which returns whether the service has synced to consul. On startup the sidecar simply passes to the daemonset its container name in an RPC call and the daemonset will then exclude the sidecar container from the overall service “readiness” check. This means that the pod itself won’t be considered “ready” until all containers are marked ready and the sidecar has ensured that the sync to consul has completed.
Rollout bumps
After the code was written and the daemonset was deployed, we took a few pilot services to migrate over to this new system to do failure testing etc. using real systems/services. After migrating the first service in stage, we found that not all the pods in k8s were in consul even though the sidecars were marked as ready. After digging around we confirmed that the pod was healthy, the service was in the agent’s services, and the agent was spewing errors in the logs that looked like:
* Failed to join <IP>: Member ‘<US>’ has conflicting node ID ‘be688838-ca86–86e5-c906–89bf2ab585ce’ with member ‘<OTHER_MEMBER>’
This log message is actually a known issue in upstream consul caused by an upgrade of the consul-agent. For our purposes, the cause here is not as important as the impact. This failure of the local consul-agent had shown us that the local agent Services API didn’t take into consideration syncing the services to the rest of the cluster.
Thankfully in our design, this is an easy fix: We simply added a check on the sidecar RPC path that ensured the service was synced to the consul cluster. This ensures that not only is the service synced to the local consul-agent but also that it is available in the cluster’s catalog API.
Conclusion
With the completion of katalog-sync we can take a step back and check how our initial pain points have been dealt with:
- Configuration: We significantly simplified our configuration through annotations on the pod. No need for consul secrets/configuration!
- Complexity: We have a lightweight sidecar process and a single consul-agent per k8s node.
- Failure modes: We have a node-local failure domain for consul syncing with (optional) ties into pod life-cycle to ensure sync on pod startup.
- Noisy Alerts: Now when a pod is removed, it’s simply a consul agent service removal. No more dangling node entries due to pod changes.
- Consul checks vs k8s checks: Checks are now defined in a single place (k8s) and synced instead of having duplicate definitions.
At this point, we have migrated our k8s infrastructure to use katalog-sync and it has already been a significant improvement over our previous sidecar consul-agent approach. If you are also using k8s and consul we hope that it will be helpful to you too. Check out: https://github.com/wish/katalog-sync.
Acknowledgements
Projects (and articles) like this require many people to accomplish, special thanks to Tomas Virgl, Thomas Wiedenbein, Micah Croff, Yvonne Nguyen, Brice Lin, Raine Medeiros, and Kevin Long.