User experience at peak usage is often awful. During heavy usage periods, the way that customers experience a network available resource - whether it is an API or a simple web page - can suffer badly enough to push customers away from your product.
Plan to scale
When planning for scale, every tool out there has capacity limits. Many can scale horizontally, and many can scale vertically, but ALL of them cost time, money, or both to scale.
Unbounded scaling is just not available for free. Even if the products in use attract no licenses fees, there are always costs: the actual machines, the CPUs, hosting, installation, complexity, build scaling, load distribution. Everything has a budget: there are only so many hours in a dev schedule and so many dollars in capital and ops budgets.
User and Budget friendliness
One way to look at budget friendly scaling is to look at the consequence of not scaling: errors, and how they affect the user experience. It has been our experience that in modern user experience, a fast error is far more preferred to a slow eventual timeout. This leads to a possible solution: What if we prefer to send a quick error over an eventual timeout? What would that look like?
Concurrency is scaling
Possibly the most common way that systems are scaled is via some form of concurrency. Old school monolithic servers, up to the minute microservice containers eventually all have a limit on how many concurrent requests that can be served before either the response time becomes unacceptably long, or timeouts occur on the request side.
Concurrency and latency are key in a larger TPS conversation
Managing scaling must pay careful attention to the queueing equation:
Average Transactions Per Second = (Average Concurrency) / (Average Latency in Seconds)
In practical terms, that means if the average latency of your API responses is 5 seconds, you need 500 outstanding requests to service just 100 requests per second, assuming the number of concurrent connections doesn’t also increase your average latency – rarely the case.
If you can somehow reduce your maximum and average latency, you can expect to have smaller concurrency and still get the transactions per second you need.
Smaller concurrency is generally better, for many reasons, including network resources, scaling at the external facing parts of the infrastructure, including load balancers, and the back-end overhead of servicing so many concurrent requests.
High Concurrency can cause a runaway at the user experience level
Let me give a more concrete example: Assuming a standard Java app server – nearly always configured with a maximum thread count. That means that for example with a configured thread count of 100 if you present 101 concurrent requests, the 101st request MUST wait until one of the other threads are completed, creating a doubled latency for that request. If you present 200 concurrent requests, then your user-apparent latency is increased 50% at least, as half the requests are waiting on free threads.
If you also have users canceling requests, the inbound retry of the request contributes to the load and concurrency, making the latency worse.
We can make a simplifying assumption that you have an internal design that is compute-bound, meaning that the number of physical cores on the machine (let’s say 16) determines the concurrency at which the service latency starts to ramp up. That means that with 100 inbound requests, you have 84 requests waiting in the application for available CPU, and 100 more waiting in the network stack and connector.
The reality of the situation usually has some level of external calls meaning some level of concurrent processing higher than the number of cores occurs. This only moves the latency further back into your infrastructure and does not make the problem disappear. Sometimes the limited resource is a database insert which ends up being only as fast as the disk will go, and again, customer visible service latency starts to ramp up.
Without concurrency limits, user satisfaction starts to really vary based on the time of day – during peak times, the latency can approach and surpass the threshold of customer patience, and then a cancel/re-request happens, making the problem worse.
Browser timeouts are insanely long; Let’s talk about human timeouts.
The Firefox Web Browser standard request timeout for HTTP calls is 300 seconds. That’s well beyond anyone’s patience.
At 300 seconds, in some contexts, the information request – even should it finally get to the browser – can sometimes be no longer relevant, resulting in another request.
Timeouts are far sooner than you think.
If the user apparent latency of API call isn’t capped at some sane amount, the overall effect is this will reduce the number of successful requests that complete – from the user’s perspective at least because more people will give up and cancel, around the 10 second mark, and your system will waste resources servicing requests that will never be delivered – still a failed request from the user’s perspective.
There’s real benefit beyond scaling: If a client-side request has a time-out or is canceled by the user, the semantics of how load balancers, the public internet, and enterprise network infrastructure works mean that the chances of the back-end server receiving any kind of actionable “Cancel” are effectively zero.
This means that waiting the standard 60 to 120 seconds – especially with the impatience of end users in mind becomes questionable.
Think of it this way: customers normally stop paying attention to the user interface between 1 and 10 seconds, according to Nielsen research going back to the 60’s. At 10 seconds, you’ve lost their attention and increased their frustration.
Because nobody pays attention long enough.
The common response from user nowadays is to wait a few seconds, lose patience, and then hit the cancel button, or close the app, or worse, the refresh button – and your infrastructure will continue to process the previous request that will never be received by the user PLUS the new one – because the UI has discarded the request context.
This wastes many seconds of back-end processing, concurrency in middle tiers and the user’s good will by allowing anything more than 10 seconds of response time in general.
The sooner you respond to the user, the better
But of course, shorter is better. We’d suggest that planning for sub 10 second interactions in general around user experience suggests that if an operation has an expectation of greater than 1 or 2 seconds, that you should explore asynchronous requests: It’s relatively easy to submit a request, internally put the request on a queue, and send some data back so the client can present some UI to check progress. This alone will virtually eliminate duplicate requests. See Asynchronous UIs - the future of web user interfaces for some thoughts on asynchronous UI. There's a way to use API Gateways to create an asynchronous API from a synchronous one using a queue and a cache.
Even for Batch
Even between machines, for effectively batch-style operations, long latency still becomes questionable: it often implies that the workflow is stacking up many concurrent operations, causing systems to use their scheduling heavily, wasting time managing the process state.
It is better for many reasons to not risk a failed client-side timeout.
Timeout early, timeout often
Our experience is that timeouts in the client software are a failure case that we must avoid, and instead plan for a client-visible error as a viable response, by managing timeout and concurrency actively.
Use infrastructure for UX
This means that API gateways and similar infrastructure tools can be used to improve your success rate with hard limits on latency and concurrency.
If you limit concurrency and maximum latency to a level your tool chain can handle and reject requests above that with a “too busy” message – perhaps via HTTP 503, 504 or some other strong signal, then you can plan for a user experience that has fewer wasted requests, and on average more successful client-visible requests.
This probably also means you can get a higher “customer perceived fully successful request” rate, even in high volume, under-powered periods, and respond better under heavy load – because it the infrastructure would prevent crash-prone overload conditions on back-end systems.
Planned acceptable error rate
Assuming you want to provide a level of service – say it’s some number concurrent users, or it’s a transaction per second rate. Using the above thinking, we can start creating a service level that is defined in terms of maximum latency, and acceptable error responses.
During unexpected scaling events like a DDOS, by putting maximum limits on concurrency, you reduce the amount of traffic that needs to be considered by back-end servers preventing things from getting out of hand.
It is far easier to plan for acceptable user experience if the requests don’t time out at the browser, but instead terminate in infrastructure. Lengthy requests don’t tie up external interfaces if you use asynchronous UI as alluded to previously. In mobile, thanks to the changing nature of network requests, this leads to a much better user experience, independent of the higher availability.
You can paradoxically improve your user experience by setting reducing concurrency and managing client-visible latency. Our own API Gateway can certainly help, but setting time-aware, concrete, user-friendly error conditions as part of the API definition is central to the strategy to get there.