Ask a network engineer to make a sketch of their enterprise architecture and you will always get something like this.
A classic enterprise stack of mirrored switches that assures the flow of data in the event of link or device failure. It does show that network engineers are thinking along parallel lines and understand the need for resilience. But, in my experience, I often find that the provision of good monitoring and testing capabilities is often omitted from these designs. This may be the result of budget or time constraints, and the network architects I know are generally comfortable with this ‘build first and test later’ approach. As this is a proven design that works well on paper, it is an understandable approach. But as the old saying goes:
“There is many a slip between cup and lip.”
The challenge for this design is when the implementation runs into the real world of different hardware vendors, support for legacy infrastructure and distributed applications that introduce special constraints and exceptions. Over time, the build, configuration and maintenance activity weakens the resilience of this design as a variety of application changes and exceptions are introduced. We have seen many deployments where services are deployed to favour the primary side for reasons of performance or security. The lack of visibility into application performance across the network is now a serious problem. Any switchover resulting from a device failure can have a range of impacts on the application performance that might take weeks to analyse completely and leaves the management decidedly worried about the health of their business critical applications. There is clearly a false economy in building a redundant system without the proper oversight of real-time monitoring.
There is a useful practice where a controlled switchover is periodically applied to test the resilience. This is a good approach and does at least confirm the availability of redundant services. These tests are normally carried out during a maintenance window at weekends when most applications are idle. Executing a switchover during a busy trading session would be like setting fire to your house to check out the sprinkler system.
Now, imagine your design is complete with data aggregation, real-time data capture and context-aware business analytics. The operational benefits are clearly that you get are;
· Always on passive-monitoring
· Forensic root-cause analysis for all network and application events.
· What-if scenarios to model potential failures based on actual observed traffic.
· Capacity planning based on real traffic observation
A resilient network that is monitored in real-time is essential for sound operational reasons but increasingly we see the financial regulator is becoming a more important driver.
A case in point are the new technical regulations in financial markets, MiFID II. The requirements clearly state the need to report complete and accurate details of executed transactions. The emphasis is on real-time analysis with provable business continuity. It is clear the EU regulator is playing catch-up on the algorithmic trading platforms which have been handling trades in sub-millisecond time-frames for several years now. Essentially the MiFID II regulations put the onus on trading firms to accurately reconstruct any trade sequence with precision timestamps to show the flow of pricing information, orders, cancellations and trades executed.
A prime broker we work with had the unfortunate experience where a network switch browned out. Now, any network engineer will tell you that a total device failure is preferable to intermittent glitches. This is because a device going down is easily detected and the fail-over correction is (should be !) near instantaneous. However, in this case the exchange link spluttered on where some trades completed but a larger unknown quantity of orders were not captured. The net result for this broker once the markets had closed was a value at risk exposure in the region of €100m. It took several weeks of intensive log-data analysis, sharing of drop-copy feeds and negotiations with counterparties to resolve this exposure.
Now imagine for a moment this same network had a comprehensive data capture system with real-time trading analytics. You would not only get early notification of the network glitches impacting your trade flow but the transparent data capture during the switchover event would fully address the concerns of both the risk and regulatory compliance officers. So, get yourself some R&R, you know it’s good for you.