Site reliability engineering (SRE) is a new discipline that focuses high availability and reliability of production systems that are mission and revenue critical for an organization. This article is meant for developers, DevOps engineers and engineering leaders on how to build highly available and reliable system for their customers. In this article we will unearth the "Why?", "What?" and "How?" to implement site reliability engineering for your organizational requirements.

Let us start with Why Site Reliability Engineering?

Imagine we are tasked with building an Amazon like eCommerce platform. Let us assume we built the eCommerce platform and now we are open to business. Initially our eCommerce platform will have few hundred to thousand users that will not be a lot of overload on the system. As the eCommerce platform gains a lot of traction from hundred thousand to million plus customers. Customer satisfaction is quint essential to the retention of your growing or existing customer base. Factors such as meeting the SLA that includes uptime of your platform for X number of hours, respond in Y number of seconds to user queries or perform an online sale in n number of seconds so on and so forth are essential to keep customers happy.

This post is for subscribers only

Sign up now to read the post and get access to the full library of posts for subscribers only.

Sign up now Already have an account? Sign in