Like so many other areas of modern life, the world of IT operations both depends on and is powered by data. There are myriad sources of data that can be harnessed to provide visibility into and control of IT infrastructure but, at a high level, there are three broad classes. There's log data, as popularised by Splunk; there's API data, such as exploited by New Relic; and there's the data that Corvil specialises in: packet data.
At Corvil, we believe that packet data is uniquely rich. Not only do network packets carry the transactions that constitute the lifeblood of the business, but each packet also provides a snapshot of the state of the network, the operating systems, and the middleware that supports those transactions. The full richness of packet data is a fascinating topic that I love talking and writing about, but one I'll have to leave for another time. For now, I want to address a slightly different topic - some of the challenges of dealing with packet data, and an important lesson I learned about a decade ago.
The most basic challenge of dealing with packet data is one of raw scale. Of all the sources of data, the network is the biggest, the fastest, and the most varied. Log data is, by design, quite targeted with only key events captured and simply formatted; API data is similarly very selective to minimize application overhead. In contrast, the data flowing through the network is the full unadulterated firehose of the activity of all layers of the technology stack.
However, the real challenge with packet data lies in making sense of it, structuring it, and extracting the value inherent in it. Our customers often tap into the network, not to monitor the network itself, but because they need a non-intrusive view into the response-times of their applications. They use our system to piece back together the application messages that are encoded and segmented into packets, and reveal a precise record of the behaviour of their application.
We have seen how the combination of these twin challenges can be daunting, and drives vendors to focus on only one or other of the challenges. Sometimes they defer the challenge of analysing the packet data, and just create plain packet-capture boxes that only vacuum up packets off the network onto disk, and leave the user to go dumpster diving into the raw packets. Alternatively they don't try to capture the firehose of packet data but immediately whittle it down to the application-level data of interest. Given the difficulty of both capturing at scale and analysing in real time, the choice to specialize is understandable.
What is less understandable is the argument that packet-capture is not necessary, or that all that is required is the ability to stream selected analytics; or even that packet capture has no place in IT Operations Analytics, or that so called "precision packet capture" is sufficient. When all you have is a hammer, you see nails everywhere.
The reality of today's IT world is that you cannot ignore the context in which the processes of interest operate. Yes, it is important to start with key performance metrics such as client experience, and those metrics are what the business stakeholders will want on a dashboard. However the business depends on the health of the infrastructure as a whole. It depends on the ability of the IT Operations team to drill down into the full technology stack. Most importantly, it depends on the ability to capture and analyze even the data that you don't expect to be relevant.
This is a crucial point: it's one on which the whole approach to security has been hinging in the last few years. We have all learned how advanced persistent threats mean it's no longer sufficient to rely on firewalls to keep threats out. Threats will get into your environment, and the best remedy is not to try to build stronger walls but to ensure that you have full visibility of all activity in your environment even where—especially where you don't expect threats.
I learned this lesson over a decade ago. At the time, we had decided to build appliances with no moving parts for robustness. This meant using flash storage, which in turn meant we had to be extremely parsimonious when storing packet captures. My team and I designed an elegant system of selectively triggered packet capture, so that we could fit within our self-imposed limits. While this gave appropriate context to the events and anomalies that we could detect, we quickly realised that there would always be events that we couldn't detect, or that would only become obvious in retrospect. The only way to provide our users with the complete context and visibility they needed was to provision our systems with sufficient disk capacity for continuous high-speed packet capture.
This lesson was reinforced several times over the years, such as during the beta-testing of our first message-based latency UI. Our tester immediately noticed that he didn't have any way to drill down into latency measurements and discover what two events were matched to create the measurement. Luckily it was something that was built but simply hadn't made it into that beta build. However our user stressed so strongly the importance of the ability to drill into the raw data underlying every analytic that we nicknamed that GUI feature after him.
The Corvil architecture
The result is that the Corvil platform today embodies these lessons. We place the ability to capture raw data, both packet data and decoded application messages, at the heart of the system. It's worth mentioning that the design is not what has been mis-characterized by some of our competitors as a slow post-storage design, where packets are first stored to disk and only to be read back again later for analysis. It's obvious that we process and analyze packets in memory. That's how we achieve real-time dashboards, alerting, and analytics streaming.
We capture and stream process packets in memory, in real-time, extract application data, and enrich the raw data with meta-data and analytics. We do all of this while optionally capturing each stage of processing to disk for retrospective use-cases and later analysis. The capture is optional and all of the analysis, alerting and streaming is done independently of it, but we find that our users nearly always elect to enable it because of the enormous value it adds:
They have the ability to drill into the detail underlying any incident, retrieve a full forensic trail of evidence supporting the analysis, and get the full context of what was happening in the infrastructure.
The evidence required to fully understand user experience sometimes lies in the packets, not the applications; without the packet data, application behaviour can sometimes remain utterly mysterious.
Finally, the integration of streaming analytics with continuous packet capture offers an advantage in the workflow it enables. With separate analytics and packet-capture systems, even from the same vendor, you typically have an awkward time navigating around, hopping from a highly digested summary view on one system to unstructured packet captures in a different GUI on another box. In contrast, our users can drill down from high-level real-time dashboards into the detail supporting them, and ultimately into the application messages and packets all in a single pane of glass. We present them with a single globally-synchronized time reference for every piece of data and analysis derived from that data, all in a single coherent user interface.
It is fascinating to see different architectures being developed to tackle the different challenges posed by tapping into the network for packet data, and it is interesting to understand why different vendors choose to focus on just some of those challenges. But no matter what you do, don't make the mistake of believing that there is a hard choice to be made between streaming analytics and continuous packet capture. Sometimes, you can eat your data cake in real-time and have a copy of it for later too!