DX Infrastructure Management

Setting SLAs

Recommend

Aug 02, 2012 03:00 PM

Stuart Weenig

Setting Performance SLAs in CA NetQoS ADA (SuperAgent)
TLDR? - Go Here.

Abstract

This document describes the best practice for setting Service-level Agreements (SLAs) in SuperAgent. It includes information about how SuperAgent calculates SLAs, what is gained by using them, best practice considerations, and how to set SLA values when your enterprise does not have specific values in mind. It is intended for engineers and SuperAgent administrators looking to find value and guidance in setting SLAs.

Introduction to Performance SLAs

SLAs are a way to quantify current performance and performance trends in a report that can be presented to a broad audience. SLAs show the percentage of transactions that are faster than a given threshold. An example of an SLA would be 90 percent of server response times must be under 20 milliseconds.

The result of an SLA is displayed in a report, which indicates whether the SLA has been met.

SLA reports provide a tangible metric to give management that shows how well the network/servers/applications are performing as opposed to only monitoring performance based on the number of outages or complaints received.
SLAs also add to SuperAgent’s reporting capabilities by tracking the behavior of the worst performing transactions over time. This tracking indicates where and when performance degradations are most serious. This tracking also enables users to understand how variable performance is from the mean data points collected in SuperAgent.

SLA Calculations

SuperAgent does not collect SLA data in the same manner as other SuperAgent data. The rest of SuperAgent’s data is recorded by taking the average performance for a given metric/application/server/network over a 5-minute period. SLAs compare every transaction to see whether or not they meet the SLA thresholds. SuperAgent records how many SLAs passed and how many failed every hour. An analogy would be instead of recording the average speed of cars on an interstate every five minutes, SuperAgent marks whether each is traveling above 40mph and reports the successes and failures each hour.
You generally configure SLAs at a level of 90 percent or greater, which means that when you configure SLAs properly, they offer additional insight into the variability of performance. SuperAgent will then have historical trend data on the 5-minute averages and on the worst performing (highest percentile) transactions that users are experiencing.

Statistics of TCP Applications

Because SLAs give additional insight into the slowest performing transactions, it is important to understand what a typical time slice of performance looks like. Most people are familiar with a standard distribution or bell curve, and when talking in terms of percentiles these are usually the first graphs to come to mind. However TCP transactions do not follow a normal distribution. Instead, there is minimum latency that is due to distance for network round trip time (NRTT) and due to I/O in server response time (SRT). Ideally, you want to have as many transactions as close to this minimum as possible.

The following graph provides an example:

This chart shows an idealized version of what SuperAgent sees over a given time period for NRTT. When performance degrades, the entire graph could shift to the right, or the tail in the graph might extend. This situation is one where SLAs are useful. SLAs enable the user to specify a threshold of where they wish the 90th and 98th percentile (note that the user can specify any percentile) to be.

Reports then show what percentile those values actually reached:

This method of configuring SLAs enables users to monitor tail behavior and set SLAs in a goal-oriented manner.

Best Practice

General Considerations

You can configure SLAs for three different metrics; Network Round Trip Time, Server Response Time, and Total Transaction Time. The following tips are useful for each SLA metric:

Network Round Trip Time
- Configure by network type, especially for WAN links. Verify that the networks with similar latency are grouped together.
- Monitor business critical applications, especially for WAN links.
- Select applications that represent the largest (or most constant) amount of traffic to each site to get more observations and statistical significance.
Server Response Time
- Do not separate by network type. Servers should treat requests independently no matter which network they came from.
- Monitor the most influential tier in a multi-tiered application and keep in mind that front end server response time will include back end server requests.
- Monitor servers for business critical applications
Total Transaction Time
- Best indicator of overall end user experience because it is dependent on Server Response Time, Network Round Trip Time, and Data Transfer Time.
- Configure on a per network type basis because it depends on network round trip time.
- Monitor business critical applications Other things to consider when setting SLAs are:
Does your organization have specific SLAs already?
For which metrics is your team responsible?
Which metrics does your manager want to see?
What areas are you attempting to improve?
How often will the reports be generated?

Configuring SLAs from SuperAgent Historical Data

You can use SuperAgent’s historical data as a starting point for configuring new SLAs. To do this, perform the following steps: 1. Navigate to the Engineering Tab in SuperAgent and select the Application, Server and or Network and

view a monthly report.

2. Scroll to the graph of the metric to SLA and record the 90th percentile and Max value from the

Statistics box:

a. Note that these are the 90th percentile and max value of the 6 hour data points being displayed on a monthly graph.
3. Set the 90th percentile threshold to double the 90th percentile from the historical data
4. Set the 98th percentile threshold to double the max value from thehistorical data

5. Monitor the SLA as it runs over the first reporting period; don’t be discouraged if it does not meet SLA at first. Wait for a full reporting period to complete before tuning.
a. After a reporting period completes, tune the thresholds accordingly.
b. The results enable you to make a more educated assessment of the thresholds.

In this example, the results obtained show that the 88th percentile of NRTT is around 25ms and the 95th percentile is around 80ms. To tune these thresholds so that they would be more likely to meet the default SLA levels of 90% and 98% we would need to raise both thresholds. The data collected gives an upper bound on where to move the first threshold. The results indicate that a threshold of around 80ms would give 95% success which would be too high as it does not leave much room for improvement. Therefore, make the 90th percentile threshold less than 80ms and more than the 25ms threshold which resulted in 88% success.
The 98th percentile threshold must be raised and monitored as there is no guarantee how high the data above the 95th percentile goes. This threshold will have to be raised and tuned to in order get close to the 98th percentile.
Tuning results for this example:

90th percentile threshold = 60ms – this puts it much closer to the 81ms level that achieved a 94.6 percent the previous month, which increases the chance of getting a successful SLA, but not one that would never have a chance at failing.
98th percentile threshold = 130ms –The original threshold of 81ms gave a result that was about 4% away from the desired 98% result. To get closer to 98% we will need to increase the threshold. We know that between the first two thresholds set there was a 56ms gap that represented about a 6% difference in percentiles. The intent is to increase the results by 4% to get above 98% so we will add an amount less than 55ms, in this case 40ms, and then monitor the results. When tuning SLAs initially the goal should be to set thresholds at a value that puts results as close as possible to the percentile threshold. This will give a starting point from which network, server and application performance can be rated. Each configuration change, upgrade, or time period can then be viewed from the perspective of did this help or hinder meeting out SLAs?

Conclusion

Configuring SLAs gives you added insight into application performance by giving data on the variability of tail performance for an application. SLAs enable you to pinpoint where and when network, server, and application resources are degraded by enabling users to run reports showing when resources fail to meet SLA thresholds.

SLAs also allow the generation of easy to read reports that can be presented to a broad audience. These reports can be run as detailed or executive level from Management tab of

SuperAgent.

Appendix I

SLA Methodology

Generate engineering report for the desired time period (preferably 1 month).
1. If Network Round Trip Time or Total Transaction Time drill into a specific network type.
Scroll to the graph of the metric to SLA and record the 90th percentile and Max value from the Statistics box:
1. Set 90% threshold to double the 90th Percentile from the historical data
2. Set 98% threshold to double the max value from the historical data
Monitor the SLA as it runs over the first reporting period; don’t be discouraged if it does not meet SLA at first. Wait for a full reporting period to complete before tuning.
1. After a reporting period completes, tune the thresholds by using the resulting percentages as a guide.