A Crash Course in Building Site Reliability

Amin Astaneh, Senior Manager, Infra Services, Acquia

From Wikipedia: Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to operations whose goals are to create ultra-scalable and highly reliable software systems.

Over the past year Acquia has built their own SRE team to help their products and services scale with the demand of our growing number of customers. We wish to share our experience so that others are enabled to do the same and reap the rewards.

This presentation will discuss how the SRE team came about at Acquia, what achievements we have made so far, and the lessons we have learned along the way. We will then show the steps on how to introduce SRE to your workplace so you can deliver more reliable and scalable services to your customers! We will specifically cover:

  • SRE's basic concepts and history from Google
  • The management support you will need to get started
  • Introducing the idea of service level objectives and error budgets
  • Operational Responsibility Assessments as a tool to measure risk
  • Creating a Launch Readiness Checklist to standardize and improve product launches
  • Finding ideal candidates for your SRE team

The intended audience are software engineers, system administrators, and managers that have a desire to improve how they do their work and how their products/services perform.