In this post I’ll share how our team measures the performance and availability of the Person API using Service Level Objectives (SLOs). Our team uses SLOs as a metric to know if there are any operational issues with the Person API as well as a tool for prioritizing new work.
SLOs are based on Service Level Indicators (SLIs), which is a quantitative measurement of some aspect of a service, whether that is an API, web application, network storage service, or the network itself. SLIs should be meaningful to the user of the service, a measurement that directly correlates with user happiness. For the Person API, we chose to use two SLIs: error rate and response time. These are common metrics for an API and are easy to measure. An SLI for a web application might be more specific to the application, such as the amount of time it takes for a user to perform a specific action.
In contrast, avoid an SLI that isn’t directly impactful to a user, such as the uptime of an individual server, or the memory, disk, or CPU usage of a server. Exceeding these metrics may have a correlation with a negative user experience, but it’s usually indirectly through the presence of high latency or increased errors. Measure those SLIs instead.
For the Person API, our error rate SLI measures the percentage of requests that respond with a status code in the 500 range, indicating a server error. We intentionally do not consider 400 status code errors, since those indicate a problem with the request rather than a problem with the API. A good SLI should remove the variability of the client/user from the equation. For our response time SLI, we measure the amount of time it takes for the Person API to respond to a GET request. Ideally this would be measured at the client-side, but we are only able to measure response time at the server-side. However, our response time measurement is done very late in the request process to get the most accurate metric possible.
Next, the SLIs should be expressed as a goal to form the SLO, which is a target or goal for a service based on an SLI. This is usually done in the form of an upper or lower limit, depending on the SLI. For the Person API, our error rate SLO is that at least 99.99% of requests will respond with a non-500 status code. For response time, at least 99% of requests will respond within 1 second (1000 ms). Our response time SLO uses a percentage of total requests, in addition to an upper limit, to handle any anomalous requests that exceed the 1 second threshold.
We picked these SLOs in the early development stages of the Person API to have a good starting point. It’s important to create and document SLOs early in the development of a new service, even if it feels like a blind guess. Regularly check in with users to make sure the SLOs align with their objectives. For example, perhaps users require a faster response time than what was initially suggested, which will help you know where to focus development efforts.
Documenting SLOs, especially for something like the Person API, also helps prospective users know whether the service will meet their needs. For example, maybe the Person API has the data someone needs, but doesn’t respond fast enough. Knowing this ahead of time helps make sure the Person API is the right fit, rather than waiting until an integration has been developed and finding out through real-world usage.
It’s also important to choose a realistic SLO. For example, achieving a 100% success rate of API requests would be impossible, but getting as close as possible to 100% would require extensive time and resources to achieve, and usually comes at the cost of fewer new features. For the Person API, new features means new data, endpoints, or functionality such as webhooks or bulk exports. In contrast, we could focus extensive time and effort on adding new features to the Person API, but this could come at the cost of lower performance and increased error rates, which would lead to a lack of trust and confidence in the Person API.
This brings us to how we use SLOs for the Person API: monitoring and prioritization. For monitoring, we have alerts that notify our team if we are breaking our documented SLOs. This could mean an operational issue with the Person API or it could also be the result of many small changes that have, for example, slowly increased our response time such that it regularly breaks our SLOs. We use this data to prioritize our work.
If we are getting close to breaking or have broken our SLOs, we prioritize work that will improve our SLOs, such as adding additional capacity, refactoring code, or fixing bugs. If we have been regularly exceeding our SLOs (i.e. fast response times and low error rates), we know that we can safely work on new features without significantly impacting our performance goals. A good SLO should hit the sweet spot between a zero-downtime service, and one that can immediately change and adapt to the needs of its users and organization.
Lastly, regularly perform load testing to gauge the impact that a change will have on your service. For the Person API, we run load tests in our development environments for every new change we make. This doesn’t give us an absolute measure of how the change will perform in production, but it does give us a relative measurement such that we can determine whether a change will have any impact on our SLOs when we push it to production.
For more information on the SLOs for the Person API, see our Guidelines documentation. For more information about SLIs and SLOs, I recommend reading Google’s book on Site Reliability Engineering (free). If you have any other questions about how we use SLOs for the Person API, feel free to contact our team: api@doit.wisc.edu
Jared Kosanovic – Enteprise Integration Architect
Division of Information Technology