Almost every tech company within the industry has to deal with customers otherwise referred to as users.
It doesn’t make a difference whether you’re running a small tech business or a large enterprise. In today’s digital world, there are specific standards and therefore specific expectations that must be met if you want to keep your organization up and running successfully.
Hence the importance of site reliability engineering (SRE) and the implications that follow.
Those implications are better known as Service Level Agreements (SLAs) and their Service Level Objective (SLO) and Service Level Indicator (SLI) counterparts. SLAs, SLOs, and SLIs are what set the standards for your terms of service and how those standards will be met.
The goal is to ensure that everyone—service providers and customers—remain on the same page at all times. They answer questions such as what’s your response time? They outline the measures taken and the consequences of not keeping their promises.
However, each element has a different responsibility, which means they each face their own set of challenges in terms of keeping customers happy.
Here’s the breakdown of SLAs, SLOs, and SLIs—and their unique challenges:
Service Level Agreements (SLA)
An SLA is essentially an agreement between provider and client that outlines measurable metrics such as uptime, responsibilities, and responsiveness as a means to manage expectations.
Of course, an SLA is only needed for ongoing paid services.
SLAs are typically drawn up as legal terms that represent the provider’s promise to their customers as well as the consequences should the provider fall short of said promised services. Those consequences would include financial penalties, license extensions, or service credits, to name a few.
The interesting thing about SLAs is that they’re rather difficult to report on, measure, and meet. The agreements themselves aren’t exactly written by technological experts and therefore often include high standards that don’t necessarily align with the evolving priorities of the tech industry as a whole.
For example, many SLAs promise that the provider’s tech teams can resolve certain product issues within 24-hours. However, the agreements don’t typically outline when that 24-hour clock starts ticking—i.e., when the customer reaches out versus when the customer sends the proper information to help the IT team identify the issue—which has generated a lot of pushback from IT managers over the years.
For this reason, it’s important to involve tech industry experts in the creation of SLAs and legal business developments to address real-life scenarios to reflect appropriate standards that can be met as guaranteed.
Service Level Objectives (SLO)
An SLO is an agreement clause within an SLA. More specifically, SLOs involve certain metrics such as uptime or response time. They’re used to set customer expectations as well as let IT and DevOps teams know the standards they need to meet.
Therefore, if the SLA is the official agreement between a service provider and customer, SLOs are the individual “promises” being made within that agreement.
However, unlike SLAs, SLOs can be used for both paid and free accounts, which include both internal and external customers. For example, internal systems such as CRMS, intranet, and client data repositories can benefit from having SLOs as it would enable internal teams to better achieve their customer-forward goals.
Of course, SLOs have their own challenges to face in that they tend to be either too vague, too complicated, or unmeasurable. Because of this, only the most important metrics should be listed under an SLO clause, keeping it as simple and concise as possible—and in plain language, so that all parties can understand.
SLOs should also account for the issues that come up in SLAs, such as client-side delays.
Service Level Indicators (SLI)
An SLI is used to measure the compliance level within an SLO. Any company that plans to measure its performance against SLO clauses absolutely needs to have an SLI in order to conduct said measurements. It’s non-negotiable.
For example, if you have an SLA stating that your systems will have an uptime of 99.98%, then your SLO will also declare a 99.98% available uptime. However, your SLI is how your uptime will actually be measured, meaning that your uptime may be 0.01-0.03% over or under what’s stated in your SLA.
Therefore, your SLI must be conditioned to meet or exceed the promise made in the initial agreement to ensure compliance between your SLI and SLA.
Relationship between SLA, SLO, SLI
To provide services on a suitable level you need to collect metrics based on SLI, define thresholds of metrics based on SLO, and monitor the thresholds of metrics so that it won’t break SLA. In short that means:
- SLIs are the metrics in the monitoring system;
- SLOs are alerting rules,
- SLAs are the numbers of the monitoring metrics applying to the SLOs.
SLAs and SLOs, SLIs face their own challenges. Arguably the biggest challenge of an SLI is the capacity to keep it simple. To remain compliant with your SLA, your SLI requires the correct metrics to track without further complicating the job of your IT team by making them chase down unnecessary metrics that don’t actually make a difference in your clients’ lives.
Together, your SLA, SLO, and SLI are all based on the assumption that the services you’re providing aren’t 100% guaranteed—although 99.98% is pretty close. The point is, you need your SLA, SLO, and SLI to work together rather than against each other in order to provide your clients with a clear set of expectations as well as your IT and DevOps teams.