Site Reliability Engineering for Kubernetes

Over the last 4.5 years, Kubernetes has dramatically improved in terms of usability and it’s now easier than ever to get started with Kubernetes. Cloud providers like Amazon AWS now have managed Kubernetes products that create and manage your clusters for you. This is a huge change compared to rolling your own Kubernetes cluster.

One of the most interesting shifts in our industry I have seen over the last 2 years is that more and more companies are now running Kubernetes with their Production workloads. This is where things start to get interesting for SREs. Now we can learn from each other, discuss common reliability issues and share reliability principles to follow that harden Kubernetes clusters.

Thanks to https://twitter.com/MindsEyeCCF for this illustration

Reliability Deep Dive

  1. Look back and analyze failures
  2. Determine goals and key metrics
  3. Create a reliability strategy
  4. Put your reliability strategy to the test

Once this framework is in action I then continuously monitor and report on the progress.

Look Back & Analyze Failures

Common Failure Modes for Kubernetes in Production

CPU related outages can be bucketed into three categories; High CPU, CPU throttling and Autoscaling via CPU.

Kubernetes Failure Mode: High CPU

Below are key excerpts (illustrated by Emily Griffin):

Kubernetes Failure Mode: CPU Throttling

Kubernetes Failure Mode: Autoscaling via CPU

Next, let’s see if there is a difference in failure modes based on cloud providers. We do assume that we will see specific failure modes more commonly associated with certain cloud providers.

Analysis Of Reported Kubernetes Outages by Cloud Provider

Next, let’s explore the common failure modes by cloud providers. Since auto-scaling, CPU, and instance shutdown is managed differently across cloud providers I do expect to see different failures being more commonly experienced on specific cloud providers(e.g. CPU with Amazon AWS).

Common Failure Modes to Prepare For by Cloud Provider

Kubernetes on AWS: Failure Modes

Kubernetes on GKE: Failure Modes

Kubernetes on Azure: Failure Modes

Kubernetes On Prem: Failure Modes

Site Reliability Engineering for Kubernetes

SREs are Software Engineers who specialize in reliability. SREs apply the principles of computer science and engineering to the design and development of computer systems: generally, large distributed ones.

We are now ready to set our goals and identify key metrics.

Determine Goals and Key Metrics

  • Number of Kubernetes clusters
  • Number of nodes
  • Number of nodes by cluster
  • Number of pods
  • Number of pods by node
  • Nodes by uptime (min to max)
  • Pods by uptime (min to max)
  • Number of applications/services
  • Number of applications/services by cluster
  • Resource utilization by node (e.g. CPU)
  • Disk writes per node
  • Disk reads per node
  • Network errors per node

Create a Reliability Plan

  1. Look back and analyze failures
  2. Determine goals and key metrics
  3. Create a reliability strategy
  4. Put your reliability strategy to the test

When we create our reliability plan we need to start with our Reliability Values, this is focused on who we are. Who is our company? Who do we service? What are our beliefs and the beliefs of our customers? What do our customers value from us?

We then need to determine our Reliability Vision, this is where we ask ourselves “why are we in this business?” and “why do our customers need us?”. Then we determine our Reliability Mission, this is where we ask “What do we do?” and “How can we change our world, industry, and community?”. Then we set our Strategic Goals, this is focused on asking ourselves, “what do we want to accomplish and when do we want to do it by?”. Finally, we determine the tactics we will use to achieve our strategic goals, we ask ourselves “how will we get there?”. We decide on our short term goals (less than 1 year) and we identify the projects, resources, and people needed to make it happen.

Fictional Reliability Plan: Internet Banking as a Service

Values🌟: Passion for customers — Putting ourselves in our customers’ shoes, finding and delivering the right solutions for them, and fixing things quickly if we get them wrong.

Vision👁️: We aim to win together by being bold and making good decisions for our customers, people, and communities.

Mission📜 : Be the world’s leading online bank, trusted by customers and loved for exceptional service.

Strategic Goals📅: In the next 5 years we want to become the most popular online bank by both total number of customers and customer satisfaction rating.

Tactics📝:

  1. Scale — to ensure we can reliably scale and meet our customer’s needs we will focus on scalability. As we onboard new customers our existing customers and new customers should have a smooth experience.
  2. Availability — to ensure customers are always able to access their internet banking we will focus on uptime as a core service offering and will fix things quickly if we get them wrong. We will ensure customers always have access to their money.
  3. Correctness — To ensure internet banking transactions are accurate and provided to customers in real-time.

I personally don’t believe you can do reliability work in a vacuum away from your customers or your company values, vision and mission.

Put Your Kubernetes Reliability Strategy To The Test

Based on what we have discovered we have seen that we need to be prepared for certain failure modes that will impact our Reliability Plan from being successful.

Based on our fictional example we can categorize align common failure modes with our reliability plan tactics as follows:

  1. 💗 Scale — CPU
  2. ✅ Availability — Blackhole, DNS
  3. 👌Correctness — Shutdown, Latency and Packet Loss

Next, we can focus on ensuring we harden our K8s clusters in a prioritized way based on what we have learned and the tactics of our company. In the next section, I will outline how to perform your hardening exercises.

Harden K8s clusters: Scale (CPU)

  • How much CPU will I need per instance?
  • Will I use Kubernetes pod priorities to manage resources?
  • How difficult is it for me to upgrade my instances and increase CPU?

Hardening Exercise #1: Kubernetes — High CPU

https://app.gremlin.com/scenarios/recommended/kubernetes-scale-high-cpu/hosts

Hardening Exercise #2: Kubernetes — Throttle CPU

https://app.gremlin.com/scenarios/recommended/kubernetes-scale-throttle-cpu/hosts

Hardening Exercise #3: Kubernetes — Autoscaling via CPU

AWS Autoscaling Docs: https://docs.aws.amazon.com/eks/latest/userguide/autoscaling.html

https://app.gremlin.com/scenarios/recommended/kubernetes-scaling-autoscaling-via-cpu/

Harden K8s clusters: Availability (Blackhole and DNS)

  • Can my Kubernetes cluster gracefully handle a node becoming unavailable?
  • Can my Kubernetes cluster gracefully handle a pod becoming unavailable?
  • How does my Kubernetes cluster handle a DNS outage?

Hardening Exercise #4: Kubernetes — Blackhole a Kubernetes node

https://app.gremlin.com/scenarios/recommended/kubernetes-availability-blackhole-kubernetes-node/hosts

Hardening Exercise #5: Kubernetes — Blackhole a region

https://app.gremlin.com/scenarios/recommended/kubernetes-availability-blackhole-a-region

Hardening Exercise #6: Kubernetes — DNS outage

https://app.gremlin.com/scenarios/recommended/kubernetes-availability-dns-outage/hosts

Harden K8s clusters: Correctness (Shutdown, Latency, and Packet Loss)

  • If a node is shutdown do I retain data integrity and correctness?
  • How does my Kubernetes cluster handle latency?
  • How does my Kubernetes cluster handle packet loss?

Hardening Exercise #7: Kubernetes — Shutdown a node

https://app.gremlin.com/scenarios/recommended/kubernetes-correctness-shutdown-a-node/hosts

Hardening Exercise #8: Kubernetes — Shutdown a service

https://app.gremlin.com/scenarios/recommended/kubernetes-correctness-shutdown-a-service/containers

Hardening Exercise #9: Kubernetes — Inject latency to a node

https://app.gremlin.com/scenarios/recommended/kubernetes-correctness-inject-latency-to-a-node/hosts

Hardening Exercise #10: Kubernetes — Inject latency to a service

https://app.gremlin.com/scenarios/recommended/kubernetes-correctness-inject-latency-to-a-service/containers

Hardening Exercise #11: Kubernetes — Inject packet loss to a node

https://app.gremlin.com/scenarios/recommended/kubernetes-correctness-inject-packet-loss-to-a-service/containers

Hardening Exercise #12: Kubernetes — Inject packet loss to a service

https://app.gremlin.com/scenarios/recommended/kubernetes-correctness-inject-packet-loss-to-a-service/containers

Conclusion

Principal Site Reliability Engineer @GremlinInc http://gremlin.com | Chaos Engineering ☁️ 💻 ⚡️💀 Previously @DigitalOcean @Dropbox @NAB @QUT

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store