Tracking Error Budget And Slo

Most organizations today track their product SLO’s to avoid being liable for breach of SLAs (Service level agreements). In case any SLO violation, they will be under obligation to pay something in return for breach of contract. Once the SLO for their product has been defined, A corresponding error budget will be calculated based on that number. For example, If 99.99% is the SLO, then the error budget will be 52.56 mins in a year. That’s the amount of downtime that the product may have in a year without breaching the SLO.

Once companies agree on the SLO, they need to pick the most relevant SLI’s(service level indicators). Any violation of these SLI’s will be considered as downtime and the duration of downtime will be deducted from the error budget. For example, a payment gateway product might have the following SLI’s.

Additional reading:
https://sre.google/workbook/implementing-slos/
https://sre.google/workbook/error-budget-policy/

Why is it challenging for many companies to track error budgets at the moment?

Usually organizations use a mix of tools to monitor/track these SLI’s (For eg: latency-related SLI’s generally tracked in APM’s such as Newrelic while other SLI’s tracked in monitoring tools such as Prometheus/Datadog etc). That makes it hard to keep track of the error budget in one centralized location.

Sometimes companies have very low retention period(<6 months>) for their metrics in Promethues. Retaining metrics for longer period may require setting up Thanos/Cortex, federation rules, and performing capacity planning for their metrics storage.

Next comes the problem of false positives - Even if you are tracking something in Prometheus, it’s hard to flag an event as false positive when the incident is not a genuine SLO violation. Building an efficient and battle-tested monitoring platform takes time. Initially teams might end up getting a lot of false positives. and you may want to mark some old violations as false positives to get minutes back into your error budget

What does the SLO tracker do?

This error budget tracker seeks to provide a simple and effective way to keep track of the error budget, burn rate without the hassle of configuring and aggregating multiple data sources.

How to set this up?

Alert manager rule to monitor an example SLI ==> Nginx p99 latency

  - alert: NginxLatencyHigh
    expr: histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[2m])) by (host, node)) > 3
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Nginx latency high (instance )
      description: Nginx p99 latency is higher than 3 seconds\n  VALUE = \n  LABELS: 

Alert routing based on tags set in checks

          global:
            resolve_timeout: 10m
          route:
            routes:
            - receiver: blackhole
            - receiver: slo-tracker
              group_wait: 10s
              match_re:
                severity: critical
              continue: true
          receivers:
          - name: ‘slo-tracker’
            webhook_config: 
              url: 
'http://ENTERIP:8080/webhook/prometheus'
              send_resolved: true
                

What’s next:

If you like to see the dashboard then please check this out!
(admin:admin is the creds. Also please use laptop to open this webapp. It’s not yet mobile-friendly)

This project is open-source. Feel free to open a PR or raise issue :)