During my work I’m often asked what the SLA (Service Level Agreement) is for a given system. I’ve tried to summarize my reply in the following blog post.
The SLA way of thinking works badly with cloud services and the highly distributed systems we create these days.
The SLA of services is insufficient because there is going to be solution-specific code in many places, which often is where problems get introduces.
If, for arguments sake, you have a perfectly efficient, bug-free solution that will never exceed its scale targets, you will still have a problem. A solution build on multiple services – like most IoT solutions I work on – will experience failures. If we look at the services in a give hot path and look at what their SLA translates into monthly downtime (in minutes), we get the following (the actual SLA numbers may not be correct, but it does not matter for the conclusion):
Monthly downtime (minutes) | ||
IoT Hub | 99,9 | 43,83 |
Event Hub | 99,9 | 43,83 |
Cloud Gateway/Service | 99,95 | 21,92 |
Azure Blob Storge | 99,9 | 43,83 |
Azure SQL DB | 99,9 | 43,83 |
Document DB | 99,95 | 21,92 |
Azure Table Storage | 99,9 | 43,83 |
Azure Stream Analytics | 99,9 | 43,83 |
Notification Hub | 99,9 | 43,83 |
350,65 |
Worse case, each service would fail independently so their downtime is cumulative. That means that the SLA-acceptable downtime is 350 minutes or almost 6 hours per month which translates to a solution availability of 99.19%. And this is just the “unplanned” downtime. Also this assumes that you have no bugs or inefficient code in your solution.
Another issue with the old or traditional way of thinking is that failure to meet SLA often results in a service credit of some percentage. So, in effect, customers are really asking what’s the % of likelihood I’ll get a refund. This is completely the wrong question to ask. SLA doesn’t help build great solutions that meet users’ needs.
At this point, you may be thinking: this is all good, but the customer will still require some sort of guarantee as to the uptime. How do you answer that question?
Well, no one said it would be easy.
Usually I try to change the discussion:
- What if we had an SLA and we went offline/down and the millions of IoT devices could not check in for an hour. How much money and goodwill would that cost you? Would an SLA of service credit for that hour make you whole?
- What if we had an SLA and were up, but you had a bug – or something else outside the scope of the SLA – that took you down end-to-end for an hour. An SLA would not help at all in that situation.
So that means we need to take a different approach.
Given that an SLA would not solve things, we need to move past that and talk about how to be successful
So what is the desired approach?
- Design the app/service to be resilient
- Design the monitoring and operations tool and process to be ready for new services
#1 is usually a focus as most customer development teams are thinking this way. Howeer, the operations team are not.
#2 is usually the elephant in the room that blocks deployments. Either by operations or compliance as they still think the old way in terms of SLA.
Customers want an SLA because that is how they are used to thinking. If they really want to take full advantage of the cloud they need to move on.