Site Reliability Engineering for Kubernetes

Thanks to https://twitter.com/MindsEyeCCF for this illustration

Reliability Deep Dive

As an SRE, I have a framework I use when doing a Reliability Deep Dive 🌊:

  1. Look back and analyze failures
  2. Determine goals and key metrics
  3. Create a reliability strategy
  4. Put your reliability strategy to the test

Look Back & Analyze Failures

Next, let’s get started by looking back at common failure modes when running Kubernetes in Production.

Common Failure Modes for Kubernetes in Production

Based on the postmortems collected and shared at k8s.af we’re able to identify the most common failure modes currently impacting Kubernetes in Production. Incidents were most-commonly caused by CPU related issues (25%) or clusters becoming unavailable due to a range of issues (25%). The remaining 50% of incidents were related to Networking (DNS, Latency and Packet Loss), Resources (Disk or Memory) or Security.

Kubernetes Failure Mode: High CPU

Several companies have reported High CPU spikes causing problems for their company and users. Target engineer Dan Woods shared how their Kubernetes clusters were impacted by a high CPU incident in a Medium post titled On Infrastructure at Scale: A Cascading Failure of Distributed Systems.

Kubernetes Failure Mode: CPU Throttling

In July 2019 Henning Jacobs (Zalando) shared a talk at ContainerDays in Hamburg “Kubernetes Failure Stories, or: How to Crash Your Cluster — Henning Jacobs”. In this talk, Henning explained that CPU throttling impacted the reliability of the cluster.

Kubernetes Failure Mode: Autoscaling via CPU

In 2017 Nordstrom shared 101 ways to crash your cluster based on experiences they had been through running Kubernetes at scale. This included examples related to autoscaling.

Analysis Of Reported Kubernetes Outages by Cloud Provider

There have been 45 reported Kubernetes Production incidents collected and shared atk8s.af. When we analyze the cloud providers that are most commonly mentioned in these postmortems, we can see that 65.8% of these incidents occurred on AWS. These were primarily hand-rolled AWS Kubernetes clusters running on EC2 (not the Managed Kubernetes service provided by AWS, EKS), there is only 1 reported AWS EKS incident.GKE users experienced 23.7% of outages followed by Azure (5.3% of outages). 5.3% of outages occurred with on-prem Kubernetes clusters.

Common Failure Modes to Prepare For by Cloud Provider

Kubernetes on AWS: Failure Modes

If you are using AWS, I recommend you focus on CPU as your primary failure mode during your hardening activities. I then recommend you investigate networking related failures (primarily blackhole, latency, and DNS).

Kubernetes on GKE: Failure Modes

If you are using GKE, I recommend you focus on Blackhole as your primary failure mode during your hardening activities. I then recommend you investigate Shutdown, Latency and DNS.

Kubernetes on Azure: Failure Modes

If you are using Azure, I recommend you focus on Shutdown as your primary failure mode during your hardening activities.

Kubernetes On Prem: Failure Modes

If you are using you run your own on-prem hardware and have your own datacenters, I recommend you focus on CPU and DNS.

Site Reliability Engineering for Kubernetes

Now we’ve explored common failure modes for Kubernetes across cloud providers, it is time to take what we have learned and use it to help us practice SRE for Kubernetes.

Determine Goals and Key Metrics

There are a number of important metrics to keep an eye on when focusing on the reliability of your Kubernetes cluster. Here is a selection of Kubernetes metrics to report on and track in real-time:

  • Number of Kubernetes clusters
  • Number of nodes
  • Number of nodes by cluster
  • Number of pods
  • Number of pods by node
  • Nodes by uptime (min to max)
  • Pods by uptime (min to max)
  • Number of applications/services
  • Number of applications/services by cluster
  • Resource utilization by node (e.g. CPU)
  • Disk writes per node
  • Disk reads per node
  • Network errors per node

Create a Reliability Plan

Now we are ready to create our reliability strategy for Kubernetes at our company. I highly recommend creating a custom strategy based on your company and the goals you are working towards as an organization.

  1. Look back and analyze failures
  2. Determine goals and key metrics
  3. Create a reliability strategy
  4. Put your reliability strategy to the test

Fictional Reliability Plan: Internet Banking as a Service

Let’s create a fictional reliability plan for a bank that provides internet banking as a service:

  1. Scale — to ensure we can reliably scale and meet our customer’s needs we will focus on scalability. As we onboard new customers our existing customers and new customers should have a smooth experience.
  2. Availability — to ensure customers are always able to access their internet banking we will focus on uptime as a core service offering and will fix things quickly if we get them wrong. We will ensure customers always have access to their money.
  3. Correctness — To ensure internet banking transactions are accurate and provided to customers in real-time.

Put Your Kubernetes Reliability Strategy To The Test

Now we’ve looked back at failures, determined goals and created a reliability plan we are now ready to put our reliability plan to the test!

  1. 💗 Scale — CPU
  2. ✅ Availability — Blackhole, DNS
  3. 👌Correctness — Shutdown, Latency and Packet Loss

Harden K8s clusters: Scale (CPU)

CPU management is important for production workloads and it can easily cause you issues if not managed appropriately. There are a few important questions you need to ask yourself:

  • How much CPU will I need per instance?
  • Will I use Kubernetes pod priorities to manage resources?
  • How difficult is it for me to upgrade my instances and increase CPU?

Hardening Exercise #1: Kubernetes — High CPU

We will be using the Gremlin Scenario “Kubernetes — Scale — High CPU” for this hardening exercise. This is a scaling scenario for Kubernetes. It will trigger high CPU. We expect that this should not degrade functionality for the user and all operations should perform as expected.

Hardening Exercise #2: Kubernetes — Throttle CPU

We will be using the Gremlin Scenario “Kubernetes — Scale — Throttle CPU” for this hardening exercise. This is a scaling scenario for Kubernetes. This scenario will increase CPU as a chain of attacks. It will be used to ensure that there are no issues related to throttling CPU. In July 2019 Henning Jacobs (Zalando) shared a talk at ContainerDays in Hamburg “[Kubernetes Failure Stories, or: How to Crash Your Cluster — Henning Jacobs]. In this talk, Henning explained that CPU throttling impacted the reliability of the cluster.

Hardening Exercise #3: Kubernetes — Autoscaling via CPU

We will be using the Gremlin Scenario “Kubernetes — Scale — Autoscaling via CPU” for this hardening exercise. This is a scaling scenario for Kubernetes. It will trigger AWS autoscaling to kick in based on CPU increasing. We expect that this should not degrade functionality for the user and all operations should perform as expected.

Harden K8s clusters: Availability (Blackhole and DNS)

Blackhole is a technique you can use to more safely make nodes and pods unavailable. It is a less destructive action than shutdown. There are a few important questions you need to ask yourself:

  • Can my Kubernetes cluster gracefully handle a node becoming unavailable?
  • Can my Kubernetes cluster gracefully handle a pod becoming unavailable?
  • How does my Kubernetes cluster handle a DNS outage?

Hardening Exercise #4: Kubernetes — Blackhole a Kubernetes node

We will be using the Gremlin Scenario “Kubernetes — Availability — Blackhole a Kubernetes node” for this hardening exercise. This is an availability scenario for Kubernetes. This scenario will make one node in your Kubernetes cluster unavailable. We expect that the application will still be able to serve user traffic and operate as expected.

Hardening Exercise #5: Kubernetes — Blackhole a region

We will be using the Gremlin Scenario “Kubernetes — Availability — Blackhole a region” for this hardening exercise. This scenario will make one region unavailable. We expect that the application will be able to route user traffic correctly. The application should operate as expected.

Hardening Exercise #6: Kubernetes — DNS outage

We will be using the Gremlin Scenario “Kubernetes — Availability — DNS outage” for this hardening exercise. This is an availability scenario for Kubernetes. This scenario will cause a DNS outage. We expect that the application will still be able to serve user traffic and operate as expected due to DNS failover. If DNS failover is not setup correctly we expect an outage to occur.

Harden K8s clusters: Correctness (Shutdown, Latency, and Packet Loss)

Data integrity and correctness is always a core customer concern. Data issues are the fastest way to lose customer trust and customers. There are a few important questions you need to ask yourself:

  • If a node is shutdown do I retain data integrity and correctness?
  • How does my Kubernetes cluster handle latency?
  • How does my Kubernetes cluster handle packet loss?

Hardening Exercise #7: Kubernetes — Shutdown a node

We will be using the Gremlin Scenario “Kubernetes — Correctness — Shutdown a node” for this hardening exercise. This is a correctness scenario for Kubernetes. This scenario will shutdown a node. We expect that the application will be able to not lose data. The application should operate as expected.

Hardening Exercise #8: Kubernetes — Shutdown a service

We will be using the Gremlin Scenario “Kubernetes — Correctness — Shutdown a service” for this hardening exercise. This is a correctness scenario for Kubernetes. This scenario will shutdown one service. We expect that the application will be able to route user traffic correctly and that shutting down one service should not have a knock-on impact to other services. The application should operate as expected.

Hardening Exercise #9: Kubernetes — Inject latency to a node

We will be using the Gremlin Scenario “Kubernetes — Correctness — Inject latency to a node” for this hardening exercise. This is a correctness scenario for Kubernetes. This scenario will latency to one node. We expect that the application will still serve traffic but possibly at a slower rate. The application should operate as expected and throw no errors.

Hardening Exercise #10: Kubernetes — Inject latency to a service

We will be using the Gremlin Scenario “Kubernetes — Correctness — Inject latency to a service” for this hardening exercise. This is a correctness scenario for Kubernetes. This scenario will inject latency to one service. We expect that the application will still serve traffic but possibly at a slower rate. The application should operate as expected and throw no errors.

Hardening Exercise #11: Kubernetes — Inject packet loss to a node

We will be using the Gremlin Scenario “Kubernetes — Correctness — Inject packet loss to a node” for this hardening exercise. This is a correctness scenario for Kubernetes. This scenario will inject packet loss to one node. We expect that the application will still serve traffic but possibly at a slower rate. The application should operate as expected and throw no errors.

Hardening Exercise #12: Kubernetes — Inject packet loss to a service

We will be using the Gremlin Scenario “Kubernetes — Correctness — Inject packet loss to a service” for this hardening exercise. This is a correctness scenario for Kubernetes. This scenario will inject packet loss to one service. We expect that the application will still serve traffic but possibly at a slower rate. The application should operate as expected and throw no errors.

Conclusion

We’ve walked through how to apply Site Reliability Engineering practices to your Kubernetes clusters. We started by looking back and analyzing failures, we then determined goals and key metrics, we created a reliability plan and we then put our reliability plan to the test using Gremlin Scenarios.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store