Three Spikes and You're Out

Corvil was the first system designed to monitor the performance of both applications and the networks they run on in real-time.

Three Spikes and You're OutBy Raymond Russell    May 15, 2014      Thinking

Corvil was the first system designed to monitor the performance of both applications and the networks they run on in real time. You might wonder if this dual ability of Corvil is simply a matter of packaging convenience: it's imperative to monitor the operation of your applications, and also independently valuable in its own right to monitor the performance of your network. Furthermore these two tasks have traditionally been handled by separate IT teams, who use different tools to achieve them. So what is the value in creating a single integrated platform for both tasks? The answer is that the performance of the two domains, application and network, are inextricably intertwined and the performance of distributed applications can neither be understood nor managed without visibility into the network layer.

In order to perform, a distributed application obviously depends on the network to transport its data quickly and reliably. However the ubiquitous IP protocol (and its transport layer TCP that is used for the vast majority of application purposes) completely abstracts away the details of the network from the application. It doesn't matter what the underlying network consists of - it may be copper, fiber, wireless, and any mixture of dozens of network technologies built on top of these media - the IP protocol provides a uniform interface. For applications, this abstraction makes sending messages across the network very simple: it creates a connection using the address of the destination, and then just writes its messages into the connection, after which they magically appear out the other end on the far side of the network. This process is much like making a telephone call for us: all we need do is dial a number and speak, without worrying about how the connection is established or how our voices are digitally encoded and decoded by the telephone network. This simplicity is powerful, allowing application developers to focus on building useful software without having to worry about how the data gets delivered across the network, so long as everything works well.

However the real challenge starts when things don't work well. Any number of things may go awry out on the network: congestion may cause packets to be buffered and hence delayed, those buffers may overflow causing packet drops, or radio interference on a wireless link may cause packet corruption. Precisely because of the simplicity of the IP protocol, applications can never learn about these events; the only thing they can see is that the messages they send show up late, or never show up at all. The network truly is a black box - when it works well, it's like magic; when it doesn't, it fails silently giving no clues as to why.

Given this, it's no wonder that application developers can harbour a certain degree of mistrust of the network. When applications suffer performance problems despite no obvious bugs in the code, or any evidence of system load on the hosts, it's easy to blame the mysterious slowdowns on the network. However life is no easier for network operators: while they can ensure that network links are up, their routes are stable, and the network has plenty of bandwidth on average, they often lack visibility into the dynamic state of the network.

Furthermore, the network is not always to blame for poor application performance; even when there is short-lived congestion and packet-loss, applications are often surprisingly resilient to it. Just as often, the performance issues are due to congestion in support systems such as databases and file-servers, or to a software design that doesn't scale with production workloads. As a result, the developers' mistrust of the network is often reciprocated: once network operation teams see that links are up and running, and there are no obvious packet-loss issues or network storms, it's easy to conclude that any performance problems must actually lie back inside the application.

One of the principal design goals of Corvil is to bridge this divide between the application and the network. Corvil appliances capture application messages on the wire in the network, and use network timestamps to establish a precise and reliable record of the timing of application events. From this record, the Corvil platform can build up a comprehensive picture of the behaviour of the applications running on the network, right up to an integrated view of the business process that the application implements. Corvil can also drill down into the details of the network packets that carry the application messages, both tracking the behaviour of the connections that the network creates and analysing the impact of the traffic generated by applications on the network.

user response time depends on both application performance and network performance

Figure 1.

Let's take a concrete example to illustrate this idea: Figure 1. shows a client on the left that connects over the network in the middle to a service provided by an application on the right. A key performance metric is the response-time that the client experiences in using the service. This might be because the client has a time-critical task to complete and hence a critical dependency directly on the response-time, or it might be that it has a large number of transactions to complete and the rate at which it can complete them is gated by the time taken by the service to fulfil each one. In any case, to capture the performance of the system, we'll look at the response-time profile over time.

graph showing user response-time measured over time with three spikes indicating slow response times

Figure 2.

Figure 2 shows what the client might observe: most of the time, the response-time is acceptable, but there are occasional glitches where it rises far higher. Note that this figure is for illustrative purposes only, and in practice we see spikes that are orders of magnitude larger than typical response-times.

Let's explore what's happening in each of these three spikes. As we saw earlier, the application has no visibility into the behaviour of the response-times: it sends a query, and some time later the reply comes back with no further information on why it took as long as it did. In order to understand what's going on, we'll need to look at the network layer too, as shown in Figure 3.

network diagram showing the cause of the first spike in response time

Figure 3.

1.In the first case, congestion on the network leads to the packets carrying the query being delayed in a router. We see clearly that the spike in application response-time is matched by the spike in network latency, and actually the non-network component is relatively unchanged over this event.

network diagram showing the cause of the second spike in response time

Figure 4.

2.In the second case, a microburst in the return path fills a router buffer and causes the packet carrying a reply to be dropped. The TCP stack in the client cannot acknowledge the packet, and the server waits for several round-trip times before concluding that the packet might have been dropped and retransmits it.

network diagram showing the cause of the third spike in response time

Figure 5.

3.In the third case, the network is operating completely normally and introduces neither additional latency nor packet-loss, and the slow response-time is entirely due to an event internal to the application on the server-side.

graph of network performance metrics with two events highlighted, demonstrating that without application metrics you have no way of assessing the impact network events on application performance

Figure 6.

If you only have the network view (shown in green and red), then you do see the events that happen on the network; however you have no way of assessing what impact those events have on the application. It may be that those events hit a non-critical query that the client is running in parallel or, if the network instrumentation isn't flow-based, you may not even know which application is affected. You don't know what action is required, if any, because you can't assess if you even have performance problem in the first place. Conversely, with only the application view (shown in blue), you can see the impact of all events on application response time, but you have no way of distinguishing between network effects and pure application events. You know action is required, but you don't know where to start looking or have any hint as to what action to take.

combined graph of application performance and network performance metrics provides context needed to determine root cause of user response times

Figure 7.

This quandary illustrates the power of having integrated network and application monitoring: you can immediately see the impact on the response times and, in one screen in the same context, see what actions are required. For example, these actions might be:

  1. Upgrade the client's access bandwidth.
  2. Hold SP accountable to the contracted SLAs.
  3. Profile the performance the server application.

For more information on how Corvil can bring a new perspective to your network and application performance challenges, please visit our website at corvil.com.

Three Spikes and You're Out

Raymond Russell, Founder & CTO, Corvil
Corvil is the leader in performance monitoring and analytics for electronic financial markets. The world’s financial markets companies turn to Corvil analytics for the unique visibility and intelligence we provide to assure the speed, transparency, and compliance of their businesses globally. Corvil watches over and assures the outcome of electronic transactions with a value in excess of $1 trillion, every day.
@corvilinc

You might also be interested in...