When short-lived events have long-lived consequences

IT infrastructure often operates at very different timescales to that of the business it supports, and microsecond events can seem quite divorced from business outcomes. In fact, achieving predictable business performance can depend critically on understanding and managing short-lived network and application events.

When short-lived events have long-lived consequencesBy Raymond Russell    May 29, 2014      Thinking

In Corvil, we are very used to dealing with short-lived events, microsecond latencies, and nanosecond timestamps, so much so that we are inured to the huge disparity between these timescales and those of human experience. We are also very used to interacting with people who deal in the same timescales as us and, when we venture beyond the narrow circle that is ultra-low latency technology, we often encounter some unexpected responses.

When we explain the capabilities of Corvil and the scales it operates at, the reactions vary from admiration ("That sounds like rocket science"), through puzzlement ("How can you actually achieve that?"), to dismissal ("You don't need anything that complex to solve my problems"). Often the last response is simply the most candid expression of a common concern underlying all of them, namely a concern that such detailed information about short-timescale events is both irrelevant and distracting. Irrelevant because the details at microsecond timescales are detached from the business, which operates at human timescales; distracting because it is impossible to digest information about microscopic events at the rate at which they occur, or likewise to make decisions based on that information.

Are short-lived events relevant?

The only comprehensive answer to this question is also somewhat evasive: it depends. Clearly many classes short-lived events are irrelevant and, even when they pertain directly to the business, the exact timing of them is often unimportant. For example, in an e-commerce context, the exact time at which a shopper clicks the web-page button to complete their purchase usually matters little to the retailer. Far more relevant are the aggregate rate at which all shoppers place orders (profitability) or the rate of new customer acquisition (growth).

Timing

By the same token, the exact time of purchase usually doesn't matter to the shopper either. However the situation can be very different in an auction, due to the use of time-priority to break the tie between otherwise identical bids. The most extreme example of this is in modern electronic trading systems where exchanges effectively run a continuous auction, and some traders compete fiercely for time priority: the reaction times of modern liquidity-taking strategies are in the single-digit microseconds, and as a result the timing of orders down to this scale is critical to winning or losing each trade.

Similar examples of where the precise timing of events matters include electronic payments systems, and real-time bidding (RTB) on web-advertising impressions. Another class of applications where the timing of events is important is real-time streaming media, such as voice-over-IP (VoIP), video conferencing, or streaming video. The nature of these applications is that voice or video samples that don't arrive on time must be discarded, compromising quality and end-user experience.

Congestion

So, events that happen at microscopic timescales in IT infrastructure can have important consequences at human timescales, and this is one direct driver for the need for fine-timescale instrumentation. However the need for monitoring is not limited to just providing a record of the events as they occurred. It is also driven by a need to understand the factors that cause them, that drop voice and video samples, that delay payments and real-time bids, that prevent traders from taking advantage of short-lived but valuable pricing opportunities.

There are many ways in which microscopic performance events can occur, and their symptoms can be quite diverse, but they are nearly always driven by the same basic phenomenology: the demand placed on a processing component temporarily exceeds its capacity, causing a processing backlog; the resulting delay, and possibly data loss, manifests itself as performance degradation.

Microburst Detection

This observation contains the seeds of an answer to the question of how to detect the causes of congestion: the answer is that we need to monitor for excesses of demand over capacity at the timescales of the events that you need to protect against. For the microscopic events that we've seen are critical in all kinds of real-time applications (whether electronic trading and bidding, or interactive and media applications), this means measuring and detecting microbursts.

Microburst measurements are quite simple to describe: you define a timescale of interest, say 1ms, and measure the maximum demand (maybe bandwidth in a network, or message-rate in an application) seen in any 1ms sub-interval. An immediate conclusion you can draw are that, if the available capacity is at least as high as the maximum microburst seen, then you will never experience any delay or processing backlog longer than 1ms. Correspondingly, if the amount of buffering is sufficient to accommodate 1ms-worth of data, then you will never experience any data-loss due to buffer overflow.

These characteristics of microburst measurements make them easy to understand, and easy in some respects to reason about their consequences. However they do not aggregate in obvious ways at all: suppose we have two different sources of demand, such as two different networks being routed out a common access link. Much longer timescale measurements of bandwidth are effectively averages and simply add up linearly: the average bandwidth from both networks is the sum of the average from each network individually. In contrast, microburst measurements pick out the microscopic peaks of demand and, the shorter the microburst timescale, the less likely it is that the peaks from each subnetwork with overlap. It is this non-linear property that also makes microburst measurements difficult to calculate, requiring significant processing power.

Queueing

Congestion is ultimately about queuing, and there are even greater surprises lurking when we consider the balance of forces that determines queuing. When the typical demand remains below the available capacity to service that demand, the queue remains mostly empty. This regime can seem quite stable over a wide-range of demand patterns, but so long as there is sufficient capacity, the system remains stable and performs well. This stability can however be deceiving: the capacity of the system to meet demand (network bandwidth or message processing power) represents a sea-wall whose breach will lead to catastrophic consequences. Once the demand exceeds capacity, queues will grow without bound, leading to a collapse in performance and massive data-loss.

The really intriguing aspect of queueing is that this response can play out over multiple different timescales. Short-lived microbursts in demand can drive periods of congestion which are a microcosm of the catastrophe we described above. The flood may not last a long time but, as we have seen, short-lived events can have serious business impact. The higher and more sustained the demand microbursts are, the longer and more severe the periods of congestion in the queue will be.

It turns out that the relationship between capacity, demand, and the extent of the congestion can only be unraveled with measurements made at the timescale of the congestion itself. The resulting imperative is that, if you want to diagnose and protect against short-lived congestion events, you must measure at microscopic timescales.

Summary

We have explored how microscopic events can and do have long-lived business consequences. We have also seen that diagnosing and troubleshooting these events requires an accurate time-record of the evolution of the event. Furthermore, understanding the causes of these events requires analytics at the timescales of the events themselves which are impossible to deduce from longer timescale measurements: microburst measurements don't aggregate in a linear fashion, while the congestion response to changes in the balance between demand and capacity is highly non-linear, exhibiting a sharp cliff-like behaviour.

It is for this reason that managing the performance of your IT infrastructure to meet your business demands requires the micro-visibility that lies at the heart of Corvil.

When short-lived events have long-lived consequences

Raymond Russell, Founder & CTO, Corvil
Corvil is the leader in performance monitoring and analytics for electronic financial markets. The world’s financial markets companies turn to Corvil analytics for the unique visibility and intelligence we provide to assure the speed, transparency, and compliance of their businesses globally. Corvil watches over and assures the outcome of electronic transactions with a value in excess of $1 trillion, every day.
@corvilinc

You might also be interested in...