Site Reliability Engineering for Kubernetes

Thanks to https://twitter.com/MindsEyeCCF for this illustration

Reliability Deep Dive

  1. Look back and analyze failures
  2. Determine goals and key metrics
  3. Create a reliability strategy
  4. Put your reliability strategy to the test

Look Back & Analyze Failures

Common Failure Modes for Kubernetes in Production

Kubernetes Failure Mode: High CPU

Kubernetes Failure Mode: CPU Throttling

Kubernetes Failure Mode: Autoscaling via CPU

Analysis Of Reported Kubernetes Outages by Cloud Provider

Common Failure Modes to Prepare For by Cloud Provider

Kubernetes on AWS: Failure Modes

Kubernetes on GKE: Failure Modes

Kubernetes on Azure: Failure Modes

Kubernetes On Prem: Failure Modes

Site Reliability Engineering for Kubernetes

Determine Goals and Key Metrics

  • Number of Kubernetes clusters
  • Number of nodes
  • Number of nodes by cluster
  • Number of pods
  • Number of pods by node
  • Nodes by uptime (min to max)
  • Pods by uptime (min to max)
  • Number of applications/services
  • Number of applications/services by cluster
  • Resource utilization by node (e.g. CPU)
  • Disk writes per node
  • Disk reads per node
  • Network errors per node

Create a Reliability Plan

  1. Look back and analyze failures
  2. Determine goals and key metrics
  3. Create a reliability strategy
  4. Put your reliability strategy to the test

Fictional Reliability Plan: Internet Banking as a Service

  1. Scale — to ensure we can reliably scale and meet our customer’s needs we will focus on scalability. As we onboard new customers our existing customers and new customers should have a smooth experience.
  2. Availability — to ensure customers are always able to access their internet banking we will focus on uptime as a core service offering and will fix things quickly if we get them wrong. We will ensure customers always have access to their money.
  3. Correctness — To ensure internet banking transactions are accurate and provided to customers in real-time.

Put Your Kubernetes Reliability Strategy To The Test

  1. 💗 Scale — CPU
  2. ✅ Availability — Blackhole, DNS
  3. 👌Correctness — Shutdown, Latency and Packet Loss

Harden K8s clusters: Scale (CPU)

  • How much CPU will I need per instance?
  • Will I use Kubernetes pod priorities to manage resources?
  • How difficult is it for me to upgrade my instances and increase CPU?

Hardening Exercise #1: Kubernetes — High CPU

Hardening Exercise #2: Kubernetes — Throttle CPU

Hardening Exercise #3: Kubernetes — Autoscaling via CPU

Harden K8s clusters: Availability (Blackhole and DNS)

  • Can my Kubernetes cluster gracefully handle a node becoming unavailable?
  • Can my Kubernetes cluster gracefully handle a pod becoming unavailable?
  • How does my Kubernetes cluster handle a DNS outage?

Hardening Exercise #4: Kubernetes — Blackhole a Kubernetes node

Hardening Exercise #5: Kubernetes — Blackhole a region

Hardening Exercise #6: Kubernetes — DNS outage

Harden K8s clusters: Correctness (Shutdown, Latency, and Packet Loss)

  • If a node is shutdown do I retain data integrity and correctness?
  • How does my Kubernetes cluster handle latency?
  • How does my Kubernetes cluster handle packet loss?

Hardening Exercise #7: Kubernetes — Shutdown a node

Hardening Exercise #8: Kubernetes — Shutdown a service

Hardening Exercise #9: Kubernetes — Inject latency to a node

Hardening Exercise #10: Kubernetes — Inject latency to a service

Hardening Exercise #11: Kubernetes — Inject packet loss to a node

Hardening Exercise #12: Kubernetes — Inject packet loss to a service

Conclusion

--

--

--

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Essential Skills Require To Become An Android Developer

Getting Started With Svelte — Link Reference

SOLID Design Principles

Designing a Symfony Validator - the TDD way 📝

A handsome, middle aged black man poking his temple. The image says “tests won’t fail if you have no tests”.

Unit testing a .Net Core web api (CRUD)

REST and gRPC — Room For Both

Intro to Interactive Rebasing in Git and Customizing Vim Preferences

[Java EE] How to get HttpSession into a WebSocket connection using a Managed Bean.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Tammy Bryant Butow

Tammy Bryant Butow

More from Medium

Kubernetes Architecture: An Overview

Architecting a development lifecycle for a Kubernetes-based deployments

Best Practices to Optimize Your Kubernetes Cloud Costs

Using NFS to Simulate Metropolitan Disaster Recovery in an OpenShift Environment