Site Reliability Engineering for Kubernetes
Over the last 4.5 years, Kubernetes has dramatically improved in terms of usability and it’s now easier than ever to get started with Kubernetes. Cloud providers like Amazon AWS now have managed Kubernetes products that create and manage your clusters for you. This is a huge change compared to rolling your own Kubernetes cluster.
One of the most interesting shifts in our industry I have seen over the last 2 years is that more and more companies are now running Kubernetes with their Production workloads. This is where things start to get interesting for SREs. Now we can learn from each other, discuss common reliability issues and share reliability principles to follow that harden Kubernetes clusters.
Reliability Deep Dive
As an SRE, I have a framework I use when doing a Reliability Deep Dive 🌊:
- Look back and analyze failures
- Determine goals and key metrics
- Create a reliability strategy
- Put your reliability strategy to the test
Once this framework is in action I then continuously monitor and report on the progress.
Look Back & Analyze Failures
Next, let’s get started by looking back at common failure modes when running Kubernetes in Production.
Common Failure Modes for Kubernetes in Production
Based on the postmortems collected and shared at k8s.af we’re able to identify the most common failure modes currently impacting Kubernetes in Production. Incidents were most-commonly caused by CPU related issues (25%) or clusters becoming unavailable due to a range of issues (25%). The remaining 50% of incidents were related to Networking (DNS, Latency and Packet Loss), Resources (Disk or Memory) or Security.
CPU related outages can be bucketed into three categories; High CPU, CPU throttling and Autoscaling via CPU.
Kubernetes Failure Mode: High CPU
Several companies have reported High CPU spikes causing problems for their company and users. Target engineer Dan Woods shared how their Kubernetes clusters were impacted by a high CPU incident in a Medium post titled On Infrastructure at Scale: A Cascading Failure of Distributed Systems.
Below are key excerpts (illustrated by Emily Griffin):
Kubernetes Failure Mode: CPU Throttling
In July 2019 Henning Jacobs (Zalando) shared a talk at ContainerDays in Hamburg “Kubernetes Failure Stories, or: How to Crash Your Cluster — Henning Jacobs”. In this talk, Henning explained that CPU throttling impacted the reliability of the cluster.
Kubernetes Failure Mode: Autoscaling via CPU
In 2017 Nordstrom shared 101 ways to crash your cluster based on experiences they had been through running Kubernetes at scale. This included examples related to autoscaling.
Next, let’s see if there is a difference in failure modes based on cloud providers. We do assume that we will see specific failure modes more commonly associated with certain cloud providers.
Analysis Of Reported Kubernetes Outages by Cloud Provider
There have been 45 reported Kubernetes Production incidents collected and shared atk8s.af. When we analyze the cloud providers that are most commonly mentioned in these postmortems, we can see that 65.8% of these incidents occurred on AWS. These were primarily hand-rolled AWS Kubernetes clusters running on EC2 (not the Managed Kubernetes service provided by AWS, EKS), there is only 1 reported AWS EKS incident.GKE users experienced 23.7% of outages followed by Azure (5.3% of outages). 5.3% of outages occurred with on-prem Kubernetes clusters.
Next, let’s explore the common failure modes by cloud providers. Since auto-scaling, CPU, and instance shutdown is managed differently across cloud providers I do expect to see different failures being more commonly experienced on specific cloud providers(e.g. CPU with Amazon AWS).
Common Failure Modes to Prepare For by Cloud Provider
Kubernetes on AWS: Failure Modes
If you are using AWS, I recommend you focus on CPU as your primary failure mode during your hardening activities. I then recommend you investigate networking related failures (primarily blackhole, latency, and DNS).
Kubernetes on GKE: Failure Modes
If you are using GKE, I recommend you focus on Blackhole as your primary failure mode during your hardening activities. I then recommend you investigate Shutdown, Latency and DNS.
Kubernetes on Azure: Failure Modes
If you are using Azure, I recommend you focus on Shutdown as your primary failure mode during your hardening activities.
Kubernetes On Prem: Failure Modes
If you are using you run your own on-prem hardware and have your own datacenters, I recommend you focus on CPU and DNS.
Site Reliability Engineering for Kubernetes
Now we’ve explored common failure modes for Kubernetes across cloud providers, it is time to take what we have learned and use it to help us practice SRE for Kubernetes.
SREs are Software Engineers who specialize in reliability. SREs apply the principles of computer science and engineering to the design and development of computer systems: generally, large distributed ones.
We are now ready to set our goals and identify key metrics.
Determine Goals and Key Metrics
There are a number of important metrics to keep an eye on when focusing on the reliability of your Kubernetes cluster. Here is a selection of Kubernetes metrics to report on and track in real-time:
- Number of Kubernetes clusters
- Number of nodes
- Number of nodes by cluster
- Number of pods
- Number of pods by node
- Nodes by uptime (min to max)
- Pods by uptime (min to max)
- Number of applications/services
- Number of applications/services by cluster
- Resource utilization by node (e.g. CPU)
- Disk writes per node
- Disk reads per node
- Network errors per node
Create a Reliability Plan
Now we are ready to create our reliability strategy for Kubernetes at our company. I highly recommend creating a custom strategy based on your company and the goals you are working towards as an organization.
- Look back and analyze failures
- Determine goals and key metrics
- Create a reliability strategy
- Put your reliability strategy to the test
When we create our reliability plan we need to start with our Reliability Values, this is focused on who we are. Who is our company? Who do we service? What are our beliefs and the beliefs of our customers? What do our customers value from us?
We then need to determine our Reliability Vision, this is where we ask ourselves “why are we in this business?” and “why do our customers need us?”. Then we determine our Reliability Mission, this is where we ask “What do we do?” and “How can we change our world, industry, and community?”. Then we set our Strategic Goals, this is focused on asking ourselves, “what do we want to accomplish and when do we want to do it by?”. Finally, we determine the tactics we will use to achieve our strategic goals, we ask ourselves “how will we get there?”. We decide on our short term goals (less than 1 year) and we identify the projects, resources, and people needed to make it happen.
Fictional Reliability Plan: Internet Banking as a Service
Let’s create a fictional reliability plan for a bank that provides internet banking as a service:
Values🌟: Passion for customers — Putting ourselves in our customers’ shoes, finding and delivering the right solutions for them, and fixing things quickly if we get them wrong.
Vision👁️: We aim to win together by being bold and making good decisions for our customers, people, and communities.
Mission📜 : Be the world’s leading online bank, trusted by customers and loved for exceptional service.
Strategic Goals📅: In the next 5 years we want to become the most popular online bank by both total number of customers and customer satisfaction rating.
Tactics📝:
- Scale — to ensure we can reliably scale and meet our customer’s needs we will focus on scalability. As we onboard new customers our existing customers and new customers should have a smooth experience.
- Availability — to ensure customers are always able to access their internet banking we will focus on uptime as a core service offering and will fix things quickly if we get them wrong. We will ensure customers always have access to their money.
- Correctness — To ensure internet banking transactions are accurate and provided to customers in real-time.
I personally don’t believe you can do reliability work in a vacuum away from your customers or your company values, vision and mission.
Put Your Kubernetes Reliability Strategy To The Test
Now we’ve looked back at failures, determined goals and created a reliability plan we are now ready to put our reliability plan to the test!
Based on what we have discovered we have seen that we need to be prepared for certain failure modes that will impact our Reliability Plan from being successful.
Based on our fictional example we can categorize align common failure modes with our reliability plan tactics as follows:
- 💗 Scale — CPU
- ✅ Availability — Blackhole, DNS
- 👌Correctness — Shutdown, Latency and Packet Loss
Next, we can focus on ensuring we harden our K8s clusters in a prioritized way based on what we have learned and the tactics of our company. In the next section, I will outline how to perform your hardening exercises.
Harden K8s clusters: Scale (CPU)
CPU management is important for production workloads and it can easily cause you issues if not managed appropriately. There are a few important questions you need to ask yourself:
- How much CPU will I need per instance?
- Will I use Kubernetes pod priorities to manage resources?
- How difficult is it for me to upgrade my instances and increase CPU?
Hardening Exercise #1: Kubernetes — High CPU
We will be using the Gremlin Scenario “Kubernetes — Scale — High CPU” for this hardening exercise. This is a scaling scenario for Kubernetes. It will trigger high CPU. We expect that this should not degrade functionality for the user and all operations should perform as expected.
https://app.gremlin.com/scenarios/recommended/kubernetes-scale-high-cpu/hosts
Hardening Exercise #2: Kubernetes — Throttle CPU
We will be using the Gremlin Scenario “Kubernetes — Scale — Throttle CPU” for this hardening exercise. This is a scaling scenario for Kubernetes. This scenario will increase CPU as a chain of attacks. It will be used to ensure that there are no issues related to throttling CPU. In July 2019 Henning Jacobs (Zalando) shared a talk at ContainerDays in Hamburg “[Kubernetes Failure Stories, or: How to Crash Your Cluster — Henning Jacobs]. In this talk, Henning explained that CPU throttling impacted the reliability of the cluster.
https://app.gremlin.com/scenarios/recommended/kubernetes-scale-throttle-cpu/hosts
Hardening Exercise #3: Kubernetes — Autoscaling via CPU
We will be using the Gremlin Scenario “Kubernetes — Scale — Autoscaling via CPU” for this hardening exercise. This is a scaling scenario for Kubernetes. It will trigger AWS autoscaling to kick in based on CPU increasing. We expect that this should not degrade functionality for the user and all operations should perform as expected.
AWS Autoscaling Docs: https://docs.aws.amazon.com/eks/latest/userguide/autoscaling.html
https://app.gremlin.com/scenarios/recommended/kubernetes-scaling-autoscaling-via-cpu/
Harden K8s clusters: Availability (Blackhole and DNS)
Blackhole is a technique you can use to more safely make nodes and pods unavailable. It is a less destructive action than shutdown. There are a few important questions you need to ask yourself:
- Can my Kubernetes cluster gracefully handle a node becoming unavailable?
- Can my Kubernetes cluster gracefully handle a pod becoming unavailable?
- How does my Kubernetes cluster handle a DNS outage?
Hardening Exercise #4: Kubernetes — Blackhole a Kubernetes node
We will be using the Gremlin Scenario “Kubernetes — Availability — Blackhole a Kubernetes node” for this hardening exercise. This is an availability scenario for Kubernetes. This scenario will make one node in your Kubernetes cluster unavailable. We expect that the application will still be able to serve user traffic and operate as expected.
Hardening Exercise #5: Kubernetes — Blackhole a region
We will be using the Gremlin Scenario “Kubernetes — Availability — Blackhole a region” for this hardening exercise. This scenario will make one region unavailable. We expect that the application will be able to route user traffic correctly. The application should operate as expected.
https://app.gremlin.com/scenarios/recommended/kubernetes-availability-blackhole-a-region
Hardening Exercise #6: Kubernetes — DNS outage
We will be using the Gremlin Scenario “Kubernetes — Availability — DNS outage” for this hardening exercise. This is an availability scenario for Kubernetes. This scenario will cause a DNS outage. We expect that the application will still be able to serve user traffic and operate as expected due to DNS failover. If DNS failover is not setup correctly we expect an outage to occur.
https://app.gremlin.com/scenarios/recommended/kubernetes-availability-dns-outage/hosts
Harden K8s clusters: Correctness (Shutdown, Latency, and Packet Loss)
Data integrity and correctness is always a core customer concern. Data issues are the fastest way to lose customer trust and customers. There are a few important questions you need to ask yourself:
- If a node is shutdown do I retain data integrity and correctness?
- How does my Kubernetes cluster handle latency?
- How does my Kubernetes cluster handle packet loss?
Hardening Exercise #7: Kubernetes — Shutdown a node
We will be using the Gremlin Scenario “Kubernetes — Correctness — Shutdown a node” for this hardening exercise. This is a correctness scenario for Kubernetes. This scenario will shutdown a node. We expect that the application will be able to not lose data. The application should operate as expected.
https://app.gremlin.com/scenarios/recommended/kubernetes-correctness-shutdown-a-node/hosts
Hardening Exercise #8: Kubernetes — Shutdown a service
We will be using the Gremlin Scenario “Kubernetes — Correctness — Shutdown a service” for this hardening exercise. This is a correctness scenario for Kubernetes. This scenario will shutdown one service. We expect that the application will be able to route user traffic correctly and that shutting down one service should not have a knock-on impact to other services. The application should operate as expected.
https://app.gremlin.com/scenarios/recommended/kubernetes-correctness-shutdown-a-service/containers
Hardening Exercise #9: Kubernetes — Inject latency to a node
We will be using the Gremlin Scenario “Kubernetes — Correctness — Inject latency to a node” for this hardening exercise. This is a correctness scenario for Kubernetes. This scenario will latency to one node. We expect that the application will still serve traffic but possibly at a slower rate. The application should operate as expected and throw no errors.
https://app.gremlin.com/scenarios/recommended/kubernetes-correctness-inject-latency-to-a-node/hosts
Hardening Exercise #10: Kubernetes — Inject latency to a service
We will be using the Gremlin Scenario “Kubernetes — Correctness — Inject latency to a service” for this hardening exercise. This is a correctness scenario for Kubernetes. This scenario will inject latency to one service. We expect that the application will still serve traffic but possibly at a slower rate. The application should operate as expected and throw no errors.
Hardening Exercise #11: Kubernetes — Inject packet loss to a node
We will be using the Gremlin Scenario “Kubernetes — Correctness — Inject packet loss to a node” for this hardening exercise. This is a correctness scenario for Kubernetes. This scenario will inject packet loss to one node. We expect that the application will still serve traffic but possibly at a slower rate. The application should operate as expected and throw no errors.
Hardening Exercise #12: Kubernetes — Inject packet loss to a service
We will be using the Gremlin Scenario “Kubernetes — Correctness — Inject packet loss to a service” for this hardening exercise. This is a correctness scenario for Kubernetes. This scenario will inject packet loss to one service. We expect that the application will still serve traffic but possibly at a slower rate. The application should operate as expected and throw no errors.
Conclusion
We’ve walked through how to apply Site Reliability Engineering practices to your Kubernetes clusters. We started by looking back and analyzing failures, we then determined goals and key metrics, we created a reliability plan and we then put our reliability plan to the test using Gremlin Scenarios.