How to Index Petabytes of Data Daily

Every day, Corvil analyzes about 8 petabytes across all its customers. A single Corvil appliance, flat out, can happily analyze several terabytes per hour. What does this mean for your company?

How to Index Petabytes of Data DailyBy Matt Davey    November 12, 2015      Product

How to Index Petabytes of Data Daily


We were recently asked to estimate how much data Corvil analyzes daily, across all our customers, and it came out at about 8 petabytes. A single Corvil appliance, flat out, can happily analyze several terabytes per hour. Contrast that with guidance from Splunk, which estimates a single server can scale to analyze 250 GB of data per day.

This isn't in any way meant as a knock on Splunk (or ElasticSearch, MongoDB, or other general indexing solutions), as it's not cheap to make that data searchable in the way that they do. It's an apples to oranges comparison. My point is simply that, as things stand today, it's pretty infeasible to send all of your network data to these tools, as the throughput is a couple of orders of magnitude too low.

This scaling problem is a challenge faced by a growing number of operations teams. They realize the incredibly comprehensive and valuable visibility that is locked up in the flows of data in the network. They want to analyze packet data in real-time because it contains a real-time record of every transaction and every interaction, right as it is happening. The network data can tell you how each app is performing, who is using it, and how each tier and supporting service is interacting, all without installing a single agent. The problem is efficiently accessing that data and making it available to the operations team.

Solutions like Streams for Splunk, and PacketBeat for Elastic, offer a partial solution. They make it relatively easy to decode streams of packets and send it to the indexer, but they don't address the scaling problem. You are forced to choose between deploying huge indexing resources (and data volume licenses, perhaps) and throwing away 95% of your network data and just sending a sample of data or summarized metrics. They don't hold onto any raw data, so the original packet data is gone.

This kind of high-level summary view means you miss out on several of the key benefits of analyzing network data in the first place: complete visibility, forensics for root cause analysis, a sequenced view of how events unfolded, and access to packet-captures for the network team.

Our customers are getting the best of both worlds. With Corvil, they plug into the full firehose of network data, and get visibility of the content and performance of apps and services, with the option to drill down to individual transactions and efficiently search for transactions, unique IDs, and more. In addition, they can easily send high-value streams of events in real-time to their indexer of choice using our library of connectors. For example, customers can send call-records for every Voice over IP call, created from the packets on the network, or a record of every database transaction exceeding a volume or response time limit. Plus, they can follow callbacks from their indexer back into the detailed Corvil data for deep dive analysis.

If you are already using Corvil data to power IT Operations, or just want to learn more about how we can help, we’d love to hear from you.

How to Index Petabytes of Data Daily

Matt Davey, Director, Product Management, Corvil
Corvil is the leader in performance monitoring and analytics for electronic financial markets. The world’s financial markets companies turn to Corvil analytics for the unique visibility and intelligence we provide to assure the speed, transparency, and compliance of their businesses globally. Corvil watches over and assures the outcome of electronic transactions with a value in excess of $1 trillion, every day.
@corvilinc