Anyone ever worked in IT or ICT industry knows about – or at least heard of – SLAs; Service Level Agreements. A service level agreement is a commitment between a service provider and a client. Particular aspects of the service – quality, availability, responsiveness – are agreed between the service provider and the service user. The user could be either external customer or other internal teams of the same organisation. However, more formal SLAs with penalties are common for customers, internal users need to know what level of service they can expect for any service provided by other teams.
Service provider teams must set clear Service Level Objectives (SLO) to be able to commit to SLAs based on them. An SLO is a target value or range of values for a service level that is measured by a Service Level Indicator (SLI). SLI is a carefully defined quantitative measure of some aspect of the level of service that is provided. Some common examples of SLIs are ‘request latency’, ‘error rate’ and ‘system throughput’.
Setting right SLO might be complicated. While choosing a specific SLI or set of SLIs might be easy, deciding which one or what combination of SLIs might be the best option for the SLO is tricky. Remember that we set SLO to be able to commit to a certain level of service and final judgement here is with the user of the service. No matter how we choose SLIs for the SLO, if the user is not getting what is expected, SLO would be useless. So we need to look at SLOs from user’s point of view.
SLAs are tied to business goals so normally DevOps are not responsible to provide them but since SLOs are needed to provide SLAs and SLOs are based on SLIs which falls under Monitoring and Alerting tasks of DevOps, they normally get involved in helping to avoid triggering the consequences of missed SLOs. They can also help to define the SLIs: there obviously needs to be an objective way to measure the SLOs in the agreement, or disagreements will arise.
But what about internal services? DevOps have to make sure they are setting accurate expectations for internal teams regarding services they provide by defining correct SLOs and by helping other service provider teams to do so as well. After all, they considered Monitoring and Alerting experts unless there is some specialist team for Monitoring and Alerting.
There are different resources about how to choose best SLIs for different SLOs, so I will not go through details of setting SLIs for any specific SLO. Just don’t forget to try to define SLO as close as possible to the user’s perspective.