Learn how Corvil network data analysis powers data visualization and Linked Data Analysis capabilities from Sqrrl, enabling users to run queries against electronic trading activity data.
Activity in a large electronic trading firm usually encompasses multiple trading strategies and systems, buying and selling different securities on different external markets. A typical trading day will see many orders placed by traders and/or clients, generating multiple trades. The activity is spread out across an infrastructure of many processing pathways, and the complexity of the environment can make it challenging to visualize performance and identify bottlenecks and suboptimal behaviors. Poor performance and risky activity in trading is usually a pattern of behavior that takes place over a period of time, rather than a single, readily identifiable event. It’s always possible that individual orders might occasionally fail to execute as intended, for reasons that are beyond a firm’s immediate control - for example temporary surges in market activity, or fortuitous actions by competitors. It’s only when an recurrent pattern of poor or unexpected outcomes begins to appear that action can be taken to find and correct its root cause.
Key to pattern identification is having the ability to collect, analyze and visualize data to reveal relationships among the important factors that can affect trading outcomes. One fascinating approach that is emerging for visualizing relationships in large data sets is Linked Data Analysis, which is being pioneered by Sqrrl, a Cambridge MA firm founded by several of the creators of Apache Accumulo. Their platform, Sqrrl Enterprise, uses Accumulo/Hadoop as a graph-oriented data-store. As raw data is ingested into the system, a Linked Data Model is used to automatically transform it into linked data. The data model encapsulates knowledge about relationships between various entities represented in the data - for example, in a trading context these might be traders, securities, orders, trades, markets, accounts, and so on.. Graph-oriented queries can then be executed against the linked data, allowing human users to visualize data relationships and patterns and machine learning algorithms to automatically conduct pattern matching and discovery analysis.
To get a feel for how Sqrrl Enterprise works we decided to point it at some order-flow data of the kind you might find in a typical brokerage environment. The first challenge to overcome in an exercise like this is getting relevant trading data from multiple systems across a large infrastructure into Sqrrl. One option is to forward data directly from the trading servers themselves, but this approach has several disadvantages that often make it a non-starter. Existing logging facilities typically don’t capture many important details that are needed to reconstruct trading activity. Turning on more detail puts additional burden on servers that are already responsible for performance-critical tasks, and may require integration work - possibly with several different software systems. Another problem is that server data can be compromised or inaccurate if the servers themselves are compromised in some way. What if a trading group sets up a new strategy server but omits to integrate it fully with the data collection system? Hopefully that rarely happens - but if it did, it’s precisely the kind of event we would want to detect.
Corvil's Network Data Analytics Platform solves these problems by using the network, rather than the servers, as its data source. Corvil listens in to trading data-flows as they travel across the network that connects trading systems together, and transforms the low-level multi-protocol network data it finds there into high-level, normalized information about trading activity. This approach enables scalable, real-time access to all of the relevant data in the environment without intruding on the trading servers themselves and with zero integration effort. (And whenever a new strategy server is turned on, Corvil will see its trading activity as soon as it hits the network).
Data extracted by Corvil, along with identifying tags (such as trading session IDs) and derived metrics (such as latencies, message rates and order fill rates) can be easily forwarded to analysis systems like Sqrrl via open Analytics Streams. The platform lets you control precisely which data fields to expose. The Corvil screenshot below shows some of the data-fields we used in this Sqrrl test-drive example:
As you can see the data includes details of messages being exchanged to create, update and cancel orders, as well as fills representing completed trades. We are modeling a Europe based brokerage environment here, where it’s common to see securities trading in multiple different currencies. (Real trading data is of course confidential - the data set used here was constructed synthetically).
One interesting aspect of the data set we used is that it contains many Reject messages, where brokerage systems declined requests by clients to update or cancel orders. What impact do these Rejects have on trading, and are there any particular systems or client firms implicated? To analyze these questions in Sqrrl Enterprise we created a simple linked data model representing orders, trades, traders, securities and gateways, and how they are related via the identifiers present in the raw messages. On ingesting the data into Sqrrl, it quickly became apparent that trades were occurring against orders that clients were trying unsuccessfully to cancel. In the screenshot below you can see the orders that were identified as most problematic, in terms of failed cancel attempts. The chart on the right below shows elapsed times between cancel attempts and subsequent trades (in milliseconds on the x-axis) - as long as several seconds in some cases.
Sqrrl lets you click through from the dashboard charts shown above to identify relationships for the particular orders involved. In the screenshot below we are looking at the trader (pmasters5), the gateway (simply ‘FixGateway’ in this example data) and the security associated with one of the most offending orders. Other details for this order (including the total value traded against it, in CHF) are shown in the box on the right:
Users can also access the original raw data for all of the associated messages, which is retained in Sqrrl’s Accumulo-based data store. Interestingly, Accumulo is a NoSQL store that provides cell-level security, which means that the details visible to different users in this screen can be precisely controlled by policy - even when the underlying data structures are dynamic.
Where Sqrrl becomes truly powerful is in its ability to run general queries against the dataset and get results back in the form of graphs, making it easy to grasp the relevant relationships for all entities in the result set. For example, here’s how we would run a query to find all orders that traded more than 5 seconds after clients attempted to cancel them:
And here are the query results:
From the graph, several things are immediately obvious. While each of the affected orders (pink nodes) is for a different security, all of them were from the same trader and passed through the same gateway (blue node), and all were denominated in CHF. With this information we can identify a particular processing pathway on which to focus our investigation - in fact it turned out that this gateway was arbitrarily delaying certain execution reports so that clients were not being informed of trades until seconds after they occurred.
This short example is just a flavor for the type of analysis that can be performed using Corvil’s Streaming Analytics in Sqrrl. A lot more information about Sqrrl’s technology and product can be found at sqrrl.com (plus a downloadable product demo), while details about Corvil’s open Analytics Streams API can be found at this link. Sqrrl exemplifies a new breed of Big Data analysis solutions that are giving practitioners more powerful ways to work with very large data sets. Our perspective here at Corvil, is that getting the right data into the right storage, analysis and data visualization systems is critical for making the most of the latest innovations in data analysis. Our focus is on making that as easy as possible for our users.