Interested in becoming a Site Reliability Engineer?

Tammy Butow
9 min readDec 10, 2016

--

Hi there,

I’m excited that you are interested in learning more about being an SRE. I am in an Engineering Leadership role at Dropbox where I work with SREs on our Databases and Magic Pocket Infrastructure Teams. We have over 500 petabytes of data, over 500 million customers and very small teams.

This list of resources was collected by Krishelle and I. Krishelle recently joined us at Dropbox after graduating from Hackbright Academy. Read more about Krishelle’s career change story and becoming an SRE over on Huff Post.

SREs are Software Engineers who specialize in reliability. SREs apply the principles of computer science and engineering to the design and development of computer systems: generally, large distributed ones.

Here are some handy links and resources to get you started learning more and picking up skills that will help you for a career as an SRE.

Don’t go it alone, join the community

There are many communities you can join on your journey to becoming an SRE. I recommend joining the following Slack Communities I’m also a member of:

  1. Chaos Engineering Slack Community: tinyurl.com/chaoseng

What is SRE?

https://www.youtube.com/embed/A_6PhdFX1rw

SRE @ Dropbox

Krishelle & Tammy — Dropbox Infrastructure Engineering
  1. Site Reliability Engineering @ Dropbox https://youtu.be/ggizCjUCCqE
  2. How we have scaled Dropbox https://youtu.be/PE4gwstWhmc
  3. Go @ Dropbox https://youtu.be/JOx9enktnUM
  4. Database Monitoring @ Dropbox https://vimeo.com/173607649
  5. Dropbox Databases Infrastructure https://youtu.be/71VryWiEw2A
  6. Adventures in MySQL @ Dropbox https://youtu.be/xFoA5wWpl0s
  7. Bridging the Safety Gap from Scripts to Full Auto-Remediation https://www.usenix.org/conference/srecon16europe/program/presentation/mah

Git Fundamentals Tutorials

Learn to use git by heart
  1. Introduction to Git: Installation, Usage, and Branches
    https://www.digitalocean.com/community/tutorial_series/introduction-to-git-installation-usage-and-branches

Code Review Best Practice Tips

  1. Read the best code you can find in your company and check out our Dropbox Open Source projects, e.g Marshal. You will learn a ton reading other people’s code.
  2. Small and frequent over large and rare. Many open source projects will reject your change if it’s over 100 lines. That’s a ton of code to read, review, spot issues and recommend changes to if needed. Keep your changes small and get feedback often. Some of the engineers on my team will often do 1–3 line diffs. http://blogs.atlassian.com/2010/03/code_review_in_agile_teams_part_ii/
  3. Why code reviews matter (and actually save time!) https://www.atlassian.com/agile/code-reviews

Vim Fundamentals

Learn VIM through play
  1. How to use vim for advanced editing of plain text or code on a Virtual Private Server https://www.digitalocean.com/community/tutorials/how-to-use-vim-for-advanced-editing-of-plain-text-or-code-on-a-vps--2
  2. VIM adventures game http://vim-adventures.com/

Linux Fundamentals Tutorials

  1. Initial Server Setup with Ubuntu 16.04
    https://www.digitalocean.com/community/tutorials/initial-server-setup-with-ubuntu-16-04
  2. Getting Started with Linux
    https://www.digitalocean.com/community/tutorial_series/getting-started-with-linux
  3. How to write a simple shell script on a VPS
    https://www.digitalocean.com/community/tutorials/how-to-write-a-simple-shell-script-on-a-vps
  4. SSH essentials
    https://www.digitalocean.com/community/tutorials/ssh-essentials-working-with-ssh-servers-clients-and-keys

Python Tutorials

19 December, 2016
  1. How to install and setup a local programming environment for Python https://www.digitalocean.com/community/tutorial_series/how-to-install-and-set-up-a-local-programming-environment-for-python-3
  2. How to crawl a web page with scrapy and Python https://www.digitalocean.com/community/tutorials/how-to-crawl-a-web-page-with-scrapy-and-python-3
  3. How To Create a Twitter app with Python https://www.digitalocean.com/community/tutorials/how-to-create-a-twitter-app
  4. How to make a simple calculator program in Python https://www.digitalocean.com/community/tutorials/how-to-make-a-simple-calculator-program-in-python-3

Go Tutorials

  1. How to install Go 1.6 on Ubuntu 16.04 https://www.digitalocean.com/community/tutorials/how-to-install-go-1-6-on-ubuntu-16-04
  2. How To Use Martini to Serve Go Applications Behind an Nginx Server on Ubuntu https://www.digitalocean.com/community/tutorials/how-to-use-martini-to-serve-go-applications-behind-an-nginx-server-on-ubuntu
  3. How to install Go and Revel on an Ubuntu Virtual Private Server https://www.digitalocean.com/community/tutorials/how-to-install-go-and-revel-on-an-ubuntu-13-04-x64-vps

Testing

  1. Write Simple tests that you want to maintain in the future and that others would want to maintain too.
  2. Use Travis CI for your GitHub projects.
    Easily sync your GitHub projects with Travis CI and you’ll be testing your code in minutes. https://travis-ci.org/. Knowledge of git version control and how to use github are pre-requisites for using travis.

Databases Tutorials (MySQL & Percona)

We use Percona at Dropbox

MySQL is used by Dropbox, Facebook, Slack, Google and many more. Most don’t only use MySQL, such as Dropbox but it is used and therefore useful to understand.

  1. A basic MySQL Tutorial https://www.digitalocean.com/community/tutorials/a-basic-mysql-tutorial
  2. How to setup replication (primary — replica) https://www.digitalocean.com/community/tutorials/how-to-set-up-master-slave-replication-in-mysql
  3. How to setup replication (primary — primary) https://www.digitalocean.com/community/tutorials/how-to-set-up-mysql-master-master-replication
  4. How To Install a Fresh Percona Server or Replace MySQL https://www.digitalocean.com/community/tutorials/how-to-install-a-fresh-percona-server-or-replace-mysql
  5. How To Create Hot Backups of MySQL Databases with Percona XtraBackup on Ubuntu 14.04 https://www.digitalocean.com/community/tutorials/how-to-create-hot-backups-of-mysql-databases-with-percona-xtrabackup-on-ubuntu-14-04

Want to try a NoSQL Database?

The Netflix member experience is offered to 83+ million global members, and delivered using thousands of microservices. Netflix uses Cassandra, you can read more about it here:

Now it’s time to setup your own Cassandra cluster using cloud infrastructure!

  1. How to install Cassandra an run a single node cluster on Ubuntu 14.04 https://www.digitalocean.com/community/tutorials/how-to-install-cassandra-and-run-a-single-node-cluster-on-ubuntu-14-04
  2. How to run a multi-node cluster database with Cassandra on Ubuntu 14.04 https://www.digitalocean.com/community/tutorials/how-to-run-a-multi-node-cluster-database-with-cassandra-on-ubuntu-14-04

Production Web Application Tutorials

It’s time to deploy to production and maintain up-time!
  1. Building for production web applications
    This 6-part tutorial will show you how to build out a multi-server production application setup from scratch. The final setup will be supported by backups, monitoring, and centralized logging systems, which will help you ensure that you will be able to detect problems and recover from them. The ultimate goal of this series is to build on standalone system administration concepts, and introduce you to some of the practical considerations of creating a production server setup. https://www.digitalocean.com/community/tutorial_series/building-for-production-web-applications
  2. How To Set Up an Apache, MySQL, and Python (LAMP) Server Without Frameworks on Ubuntu 14.04 https://www.digitalocean.com/community/tutorials/how-to-set-up-an-apache-mysql-and-python-lamp-server-without-frameworks-on-ubuntu-14-04

Networking Tutorials

  1. An Introduction to Networking Terminology, Interfaces, and Protocols https://www.digitalocean.com/community/tutorials/an-introduction-to-networking-terminology-interfaces-and-protocols

Distributed Systems Resources

  1. Introduction to Distributed System Design http://www.hpcs.cs.tsukuba.ac.jp/~tatebe/lecture/h23/dsys/dsd-tutorial.html
  2. Blog on building bigger, faster, more reliable websites http://highscalability.com/

Monitoring and Logging

  1. Building for Production: Web Applications — Centralized Logging https://www.digitalocean.com/community/tutorials/building-for-production-web-applications-centralized-logging
  2. https://www.digitalocean.com/community/tutorials/how-to-install-elasticsearch-logstash-and-kibana-elk-stack-on-ubuntu-16-04
  3. How To Gather Infrastructure Metrics with Packetbeat and ELK on Ubuntu 16.04 https://www.digitalocean.com/community/tutorials/how-to-gather-infrastructure-metrics-with-packetbeat-and-elk-on-ubuntu-16-04
  4. How To Install Nagios 4 and Monitor Your Servers on Ubuntu 14.04 https://www.digitalocean.com/community/tutorials/how-to-install-nagios-4-and-monitor-your-servers-on-ubuntu-14-04
  5. How To Use Icinga To Monitor Your Servers and Services On Ubuntu 14.04 https://www.digitalocean.com/community/tutorials/how-to-use-icinga-to-monitor-your-servers-and-services-on-ubuntu-14-04

Estimation

  1. “How long will it take you to build this?” — A very hard question when you are starting out. Also hard even with years of experience. Try and figure out how long it takes you to do something by measuring your time spent. You will be able to know your engineering pace and as you improve you will see your pace adjust. https://www.atlassian.com/agile/estimation
  2. Improve Your Project Estimation Skill. — Having a good project estimation skill will help us accurately calculate the “time invested” for each project. Read the Effective Engineer to learn more.
  3. Allow buffer room for the unknown in the schedule. Take into account competing work obligations, holidays, illnesses, etc. The longer a project, the higher the probability that some of these will occur.
  4. Define measurable milestones. Clear milestones can alert us as to whether we’re on track or falling behind. Use them as opportunities to revise our estimates.

Prepping for Tech Interviews

  1. Deploy your project on multiple services (Amazon AWS, DigitalOcean, etc)
  2. Get the GitHub Student Backpack if you are able to, it has tons of savings: https://education.github.com/pack
  3. Rewrite your project in a new framework/language or try and use a switch to a different database
  4. Check out http://girlgeekacademy.com/ for upcoming events
  5. Do a Databases Interview Preparation Course: https://www.go1.com/#!/course/databases-interview-preparation/980537
  6. Go to a hackathon like She Hacks
  7. Whiteboard through coding questions in “Cracking The Coding Interview” with friends and deep dive into your projects or practice troubleshooting examples.

Recommended Books & Papers

  1. Site Reliability Engineering http://shop.oreilly.com/product/0636920041528.do
  2. Cracking the Tech Career https://www.amazon.com/Cracking-Tech-Career-Insider-Microsoft/dp/1118968085
  3. Cracking the Coding Interview https://www.amazon.com/Cracking-Coding-Interview-Programming-Questions/dp/098478280X
  4. The Effective Engineer http://www.theeffectiveengineer.com/
  5. The Go Programming Language https://www.amazon.com/Programming-Language-Addison-Wesley-Professional-Computing/dp/0134190440
  6. Automating the boring stuff with Python https://automatetheboringstuff.com/
  7. TCP/IP Guide http://shop.oreilly.com/product/9781593270476.do
  8. Understanding Linux Network Internals http://shop.oreilly.com/product/9780596002558.do
  9. Security for web developers http://shop.oreilly.com/product/0636920041429.do
  10. How complex systems can fail http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf

Meetups

  1. Ladies who Linux https://www.meetup.com/Ladies-Who-Linux-Bay-Area/
  2. Women Who Go www.meetup.com/Women-Who-Go/
  3. PyLadies https://www.meetup.com/PyLadiesSF/

The first SRE

Margaret Hamilton working on the Apollo program on loan from MIT, had all the significant traits of the first SRE. In her own words, “part of the culture was to learn from everyone and everything, including from that which one would least expect” — Site Reliability Engineering, How Google Runs Production Systems.

Best of luck on this exciting journey!

Good Luck,
Tammy & Krishelle

--

--

Responses (12)