Tracking Error Budget And Slo

April 08, 2020

Most organizations today track their product SLO’s to avoid being liable for breach of SLAs (Service level agreements). In case any SLO violation, they will be under obligation to pay something in return for breach of contract. Once the SLO for their product has been defined, A corresponding error budget will be calculated based on that number. For example, If 99.99% is the SLO, then the error budget will be 52.56 mins in a year. That’s the amount of downtime that the product may have in a year without breaching the SLO.

Once companies agree on the SLO, they need to pick the most relevant SLI’s(service level indicators). Any violation of these SLI’s will be considered as downtime and the duration of downtime will be deducted from the error budget. For example, a payment gateway product might have the following SLI’s.

Latency on p95 for requests
ErrorRates
Payment failures etc

Additional reading:
https://sre.google/workbook/implementing-slos/
https://sre.google/workbook/error-budget-policy/

Why is it challenging for many companies to track error budgets at the moment?

Usually organizations use a mix of tools to monitor/track these SLI’s (For eg: latency-related SLI’s generally tracked in APM’s such as Newrelic while other SLI’s tracked in monitoring tools such as Prometheus/Datadog etc). That makes it hard to keep track of the error budget in one centralized location.

Sometimes companies have very low retention period(<6 months>) for their metrics in Promethues. Retaining metrics for longer period may require setting up Thanos/Cortex, federation rules, and performing capacity planning for their metrics storage.

Next comes the problem of false positives - Even if you are tracking something in Prometheus, it’s hard to flag an event as false positive when the incident is not a genuine SLO violation. Building an efficient and battle-tested monitoring platform takes time. Initially teams might end up getting a lot of false positives. and you may want to mark some old violations as false positives to get minutes back into your error budget

What does the SLO tracker do?

This error budget tracker seeks to provide a simple and effective way to keep track of the error budget, burn rate without the hassle of configuring and aggregating multiple data sources.

Users first have to set up their target SLO and the error budget will be allocated based on that.
It currently supports webhook integration with few monitoring tools(Prometheus, Pingdom, and Newrelic) and whenever it receives an incident from these tools, It will reduce some error budget.
If a violation is not caught in your monitoring tool or if this tool doesn’t have integration with your monitoring tool then the incident can be reported manually through the user interface.
Provides some analytics into SLO violation distribution. (SLI distribution graph)
This tool doesn’t require much storage space since this only stores violations but not every metric.

How to set this up?

Clone the repo
Repo already has a docker-compose, So just run docker-compose up -d and setup is done!
Default creds are admin:admin. This can be changed in docker-compose.yaml.
Now set some SLO target in the UI.
To integrate tool with other monitoring tools you can use following webhook url’s.
- For prometheus: serverip:8080/webhook/prometheus
- For Newrelic: serverip:8080/webhook/newrelic
- For Pingdom: serverip:8080/webhook/pingdom
Now set up rules to monitor SLI’s in your monitoring tool (Let’s see how this can be done in Prometheus)

Alert manager rule to monitor an example SLI ==> Nginx p99 latency

  - alert: NginxLatencyHigh
    expr: histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[2m])) by (host, node)) > 3
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Nginx latency high (instance )
      description: Nginx p99 latency is higher than 3 seconds\n  VALUE = \n  LABELS: 

Alert routing based on tags set in checks

          global:
            resolve_timeout: 10m
          route:
            routes:
            - receiver: blackhole
            - receiver: slo-tracker
              group_wait: 10s
              match_re:
                severity: critical
              continue: true
          receivers:
          - name: ‘slo-tracker’
            webhook_config: 
              url: 
'http://ENTERIP:8080/webhook/prometheus'
              send_resolved: true
                

Add different tags if you don’t want to route requests based on the severity tags

What’s next:

Add a few more monitoring tool integration
Tracking multiple product SLO’s
Create more dashboards for analytics
Better visualization tools to pinpoint problematic services

If you like to see the dashboard then please check this out!
(admin:admin is the creds. Also please use laptop to open this webapp. It’s not yet mobile-friendly)

This project is open-source. Feel free to open a PR or raise issue :)