Performance testing is something we often get asked about by our API management and SOA governance customers. We have encountered situations where understanding of what constituted good performance was not clear at the beginning of a test effort.
Benchmarking Web services usually involved simulating lots of users and sending lots of messages to simulate a heavy production situation. Often we're called upon to simulate production environments using our gateway, so we often become central to discussions around performance. Equally important is a desire to understand what statistics are available to measure, and what the relationship between them can be.
We have identified key metrics that are used in various organizations as the most important statistics that are reported on. The reasons to choose one or the other vary but follow a short list of themes. Some organizations characterize one metric or another as being the key performance indication. We've encountered several in our engagements and the following list stands out:
Requests Per Second
- In terms of overall statistics, reporting systems often cover the number of requests served in a given month, or discussing worst case burst period request traffic. This implies metrics based on requests per unit of time and usually becomes a discussion regarding throughput measured as the number of requests per second.
- Often requests per second are limited by networking issues:
- Network latency for very small messages can be a significant part of the whole request time overhead. This limits the number of new requests that can be accepted by the application stack in a given time period.
- Network Bandwidth for large messages can limit requests per second as you just can't put any more traffic on the network interface. Modern hardware can easily swamp a Gigabit network with large messages.
- Some operations are time consuming and necessarily synchronous, like certain kinds of LDAP lookups, database queries, etc. Often the only way to optimize this is to redesign your workflow to cache, or reduce these time wait cycles.
- Often requests per second are limited by networking issues:
Bytes Per Second
- We do encounter proposed benchmarks measuring throughput based on bytes per second. For a variety of reasons these prove to be difficult to plan well. To accurately model a proposed application you would need knowledge of:
- Average incoming Message size
- Average Back End Response time
- Maximum concurrency of back end systems
- Bottlenecks at Authorization and Authentication systems
- Measuring the total elapsed wall clock time it takes a request to be serviced is critical in certain types of applications. We encounter this fairly often and it has its own list of criteria that need to be taken into account.
- Usability of user interfaces is often enhanced with a faster response times.
- Latency becomes a performance benchmark especially in chatty applications that use a large number of requests to service a single user action.
- We're often tasked with measuring this and our dashboard UI features instrumentation to show the separate components of request latency.
- Its important to note that latency and concurrency are often in opposition when building test cases.
- External decision points like LDAP, Single Sign On systems often contribute latency to a whole application.
- Concurrency as we define it is the number of requests in flight being simultaneously processed at a given time. There are several states a request can be in: the initial TCP connection phases, request message processing in the gateway, request servicing by a back end system, and the send of the response back to the originating system.
- Once a message reply is sent back to the requesting system, the gateway resources are freed to process other requests.
- Web services usually don't keep session information in infrastructure, so each request is idempotent and by definition, Service Oriented Architecture is granular to the message level, not at a live connection level like Client/Server architecture level.
- Concurrency is usually the most misunderstood statistic in any performance discussion. This is covered in detail in the next section.
Number of users supported
- This is rarely encountered. We have seen it used interchangeably with maximum concurrency, though they do mean quite different things.
- Often it means usually means the concept of application and application firewall are somewhat intertwined.
Concurrency, Number of Users and Latency
In this section I present a way to interrelate users, concurrency and latency.
Planning for a good user experience and sizing your enterprise solution is a complex undertaking as there are at least five different parameters as inputs and quite a few ways of looking at the problem. Doing the math here is important to understand the issues.
One of the most common assumptions in sizing is that large concurrency is required to support a large numbers of simultaneous users interacting with the application. The usual mandate is to support your user base, and to plan to accommodate a worst case situation, so lets see what real concurrency is needed by a large number of users.
The User Base
Lets assume 20,000 users. This analysis assumes a somewhat casual user base, because an enterprise that has 20,000 users using a single line of business application demands quite a bit more analysis than a short paper.
Let's further assume the application is web based, but has a core component that is sourced from some services component, i.e. the portal model. Most of the HTTP requests for given page are things like images, CSS and other small static files and are serviced by web servers, not application servers, and so don't figure into this analysis. The calculations I present here are also applicable for fat client GUI-style applications because the same kind of technology choices around minimizing server round trips to heavyweight services hold true for GUI applications.
We think it is important to digress into application design for a moment. Designing an application to do live queries for small parts of user interface content is not good practice whether its client/server, a web app, or fat client. Waiting for even local network latency to fill in the content of UI elements like drop down lists gives extra waiting states during the painting of a display. This make the UI appear unresponsive, and makes a large client base roll-out impractical for even unsecured applications, just from sheer request volume.
We are making a best practices assumption that services applications are designed to do one or two larger critical path requests as the core of the application service. We'll assume that most pages have a single type of information the user wants to view, but some will be more complex. I've chosen to use an average of 1.25 service requests per page view, reflecting a mix of page types.
Lets talk about what medium service latency means. Static content requests that require no processing should be sub millisecond in latency, but actual service requests are normally in the 10 milliseconds to 5000 milliseconds range on the back end. Later I'll show how the service request latency is a hugely important number in determining required concurrency.
So far we have 20,000 users, with 1.25 service requests per page, and each of those request taking from 10 to 5000 milliseconds to process.
Requests Per Second
Next we need to determine how many requests those users will generate.
Given the way that people read and use applications, the bare minimum time it takes to recognize a fully rendered page or UI, find the content you are looking for, then choose a navigation element to initiate another request is likely 3 to 5 seconds. That's the bare minimum. I'm calling this time that users are not generating new requests to back end services the page dwell time.
Dwell time on a page of something like an invoice, a purchase order or a line of business task like a shipping request is going to be longer than 5 seconds.
So given a page dwell time between 5 and 60 seconds, over the course of an hour, 20,000 users are going to generate between 0.75 and 18 million requests, or between 208 and 5000 requests per second. This is a reasonable number for the ove requests per second statistic, but leads us in the the discussion of needed concurrency and how latency is by far the critical statistic.
The calculation for the required concurrency is as follows: 20000 users generating 1.25 service requests per page every 5 seconds would generate, on average 20K * 1.25 * (5/60) or 30,000 requests per minute or 5000 requests per second. We need to retire 5000 requests every second and the service takes 10 milliseconds to retire a single request. In one second there are 100 periods of 10 milliseconds, so in each of these 10 millisecond periods we need to retire 5000/100 or 50 simultaneous requests.
Required Concurrency=Requests per second / (1/Latency in seconds)
In our first example 5000/(1/0.010) = 5000/100 = 50. Of very important emphasis here is the effect of latency on concurrency. Keeping minimum dwell to 5 seconds, and starting from 10 to 5000 milliseconds service latency, concurrency requirement goes from only 50 concurrency required to service 20K users all the way to 16667 concurrency. At this point performance of the system is at a crawl, because at 1.25 requests per page, its would take an average of 7.5 seconds just to make the data available to render the page.
There are a large number of simplifications in this calculation but it does demonstrate that characterizing the load and the user experience has a huge impact on a prediction of required concurrency. How long will your users wait for data before they decide the system is too slow?
Requests per user action has a direct relationship to concurrency as well. Less clear is the effect of page dwell time. These worst case numbers reflect a given user, on average, asking for new content every 5 seconds. That's kind of fast for most pages, unless you've built the system with lots of paging through content. Then when what they need is on page 3, they won't wait 5 seconds to ask for new content. This can create a worst case scenario unintentionally as user acceptance testing may not accurately reflect how often people generate new requests, because the UAT environments often are not loaded with enough data to require paging through content.
Latency is Key
Latency is inversely proportionate to needed concurrency
In the discussion of concurrency I presented a look at application analysis with total application service latency as a huge determining factor in concurrency requirements. There are many contributors to latency, and our gateway product is often a focal point for analysis of latency.
The above sequence diagram describes the processing steps and messages, internal lookup requests and points of latency when servicing a single inbound request at the Gateway.
The CA API Gateway Manager Dashboard specifically reports the time between steps 1 and 12 as the Front End Response time and the time between 8 and 10 as the Back End response time.
Experience has shown us that those are the two most important items to report when measuring latency.
Of note in this example is that the maximum front end response time or more accurately, the latency experienced by the end user was only 132 milliseconds even though the back end response time was 100 milliseconds.
In almost all scenarios we've encountered in the field, Step 9, the back end processing time is the bulk of the latency. This is beyond our control, but we do our best to help here: an efficient requester subsystem, controls on concurrency and connection caching for SSL.
There are some components of overall latency that we end up classifying as "our local processing overhead". One of them, Step 4, LDAP Lookup Time is minimized somewhat by our authentication cache, but still can be a limiting factor. This call to LDAP has similar analogs in Single Sign On authorizations and other methods of external decision point references. This latency is not separately described in our UI, and may in some cases result in the gateway itself suspected as being a source of latency.
Also of particular impact is cryptography. Cryptographic operations can incur latency and/or heavy CPU usage depending on the use of internal HSM, internal software cryptography or external HSM solutions. We have very efficient cryptographic capabilities, but there is an associated mathematical complexity associated with public key operations that no system can avoid.
With back end latency so dominating normal performance testing, we optimized our systems to minimize delay in back end processing. Our simplest case in small messages has us processing 20k requests per second, for latency in the sub-millisecond range, so in most cases, the gateway is not contributing a any significant amount to latency.
Some policy elements have latency associated and can be avoided in latency sensitive applications; Auditing is the obvious one as it has dependencies associated with synchronously waiting for the auditing subsystem to write to hard disk. CA has identified some usage patterns that have added undue amounts of overhead to requests and we can help you with those situations, just ask.
Latency in the whole transaction is one of the most important determining factors in sizing services deployments. The CA API Management Suite product line is not a large contributor to latency but can be used to measure and in some cases help alleviate these issues. We're happy to help you analyze your prospective workload and help plan for sizing your installation.