The age-old question of whether to use TCP or UDP for message-exchange is one that a client of ours tackled again recently. Of course, “age-old” is perhaps an exaggeration, given that the protocols are younger than I am, but at the heart of the question our client faced is the tension between performance and reliability, which is indeed timeless.
I came across an intriguing example of this recently: at least part of the success of the Roman army was due to its excellent system of communications, and speed was essential to this. As well as their courier system (thecursus publicus), they experimented with other faster methods, including the “torch and water barrel” method. According to Christopher Sterling, this was a relay method based on a code transmitted over visual distances via torches. Each sending and receiving station was equipped with a common numbered list of messages, and a barrel of water with a numbered gauge inside. To send a message, both sender and receiver would light torches and, once suitably coordinated, the sender would raise his torch at the same time as removing a plug from the bottom of his barrel to let the water run out. Once the water level dropped to the mark on the gauge corresponding to the numbered message to be sent, he would drop his torch again. On the far end, once the receiver saw the sender's torch raised, he would also remove the plug from his barrel, and replace it as soon as he saw the sender drop his torch. He could then look inside his barrel and check where the water had dropped to and read off the number of the message and decipher it.
This method had the advantage of being very fast, taking at most a few minutes to send a reasonable number of bits over a long distance. Unfortunately, despite the ingenuity of the method, it was “notoriously inaccurate in practice” due to the difficulty of calibrating the barrels to leak at a similar enough rate to disambiguate the message numbers. They generally had to fall back to written messages carried by courier on horseback.
Back in modern times in the datacenter, our client was luckily working with far superior technology and, rather than having to choose between “horseback rider” and “torch and barrel” for sending messages in their distributed application, had the more enviable choice between TCP and UDP. Nevertheless, they still had to make the choice, with the tension between speed and reliability at the heart of it. This tension manifested itself in conflicting views between an application development team and a network operations team. The reason our clients shared the story with us was that they used their Corvil deployment to resolve the conflict, as we’ll see below, as well incorporating it into their ongoing DevOps process.
The goal of the application on which the team was working led them to focus on the speed at which it could publish messages between the distributed components comprising it. A key decision for the application architects was whether to use TCP or UDP for transporting these messages, and a key metric they considered in making this decision was the messaging delay, namely the length of time from when one component sent a message to the time the target component received it. Given the prevailing wisdom that UDP is faster than TCP, the application team were inclined to plump for the former, but did some experiments to investigate.
The infrastructure supported a solid implementation of PTP (Precision Time Protocol) so the application developers were able to get good measurements of the component-to-component message delays, and compare the results under both TCP and UDP. Initially the results were not very different for the two protocols: over TCP, the application achieved a median transport time of about 16us, while over UDP it achieved just under 12us - a little faster, but not dramatically so. The outliers were smaller using UDP too, with the 99th percentile under 12.5ms as compared to over 33.5ms over TCP.
The interesting part of the initiative came when they started to do some stress-testing. The initial results were collected under low load, with message-rates of about 5,000 messages per second (5K msgs/sec) per thread. As they changed the profile of the application behaviour, introducing bursts of 25K msgs/sec, and then up to 250K msgs/sec, the results changed dramatically. First of all, the components using UDP started to see large numbers of messages go missing - they were sent, but never arrived. Knowing that UDP is an unreliable protocol, and citing the fact that the TCP-based version never lost messages, the application team pointed the finger squarely at the network.
Fig. 1: Higher message-rates leading to higher TCP latency, and spikes of data loss over UDP.
As a result, the network team brought Corvil into the conversation. A configuration change on their packet-broker was all that was needed to add the network segment in question to the scope of their Corvil deployment. They used Corvil to start to analyse the traffic exchange by the application, which quickly confirmed that no packets were being dropped by the network. To confirm this at the application layer, they used the Corvil SDK to create a plug-in for Corvil Analytics to decode the messages in-flight on the network. The Corvil user was able to show the application team that all their messages were indeed intact on the network. For good measure, they also deployed Corvil Sensor on the machine hosting the recipient application components. This allowed them to confirm that all messages were being successfully received in the operating system, ruling out any drops in the NIC (network interface card).
This evidence persuaded the application team to look more closely at their code and they realized that they were doing message processing in the same thread that was receiving the data from the UDP socket. It turned out that their application was not able to process all the message at the rate at which the sender was bursting them. After a couple of days work, they created an experimental version of the application that moved the heavyweight message processing into another application thread. This new version worked much better, and only rarely dropped messages.
With the message loss resolved, they returned to the question of transmission delay. This time, they found a dramatic difference between TCP and UDP: the median delay over TCP was over 110ms, orders of magnitude larger than at low message rates. In contrast, the delay over UDP was rock solid at only 13us. Again the application team were inclined to blame the network, suspecting that TCP congestion control was unnecessarily throttling the application communications. Using Corvil to look more closely, they found that the lion’s share of the delay had already accumulated by the time the messages reached the network.
It did look like the sending application was bottlenecked in transmitting messages onto the network. Apparently this triggered an interesting discussion between the application and network teams on whether the network MTU was appropriately set, but this proved to be a moot point: Corvil Analytics also highlighted zero-window advertisements coming from the TCP socket hosting the receiving application. Window advertisements are the mechanism by which TCP ensures that a fast sender can't overwhelm the buffers in the receiver, and zero-windows mean that the receiver is completely stalled. These showed up exactly when the message delays spiked, and were the proverbial smoking gun: the Corvil analysis showed that the TCP stack was buffering the messages because the receiver was not able to process messages as fast as they were arriving.
Fig. 2: Corvil dashboard built by the application development team, showing the full end-to-end profile of message transmission delay for both TCP and UDP. The periods of high delay over TCP match exactly the periods where the application cannot keep up with the message transmission rate. The TCP stack on the receiver is forced to advertise that its receive window is full and, in turn the sender must buffer the messages, resulting in the large transport delays. The UDP version of the application is not affected by such behaviour and maintains a flat and low message-delay profile.
Again the underlying cause was the same as that driving the original loss over UDP. In this case, our client decided to make no changes to the application behaviour when using TCP. Once they understood that a genuine application resource issue was driving the slowdown in message transmission, and not a network issue, they were happy to leave TCP to handle the buffering that they had to do manually over UDP. Furthermore, knowing that UDP held only a moderate speed advantage over TCP under normal operational conditions, they decided ultimately that the extra reliability that TCP brings was worth it. Despite a desire to eliminate as much delay as possible from their messaging they realised, like the Romans with their torch-and-barrel method, that the integrity of the messaging is too important to compromise.
For me, the most gratifying outcome of this exercise was the decision the application team took to incorporate Corvil Analytics into their operations. Not only have they committed to leveraging our analytics during future application development cycles, they are designing Corvil appliances into their deployment plans for the application. Already they have started by using the Corvil Splunk App to stream the Corvil analytics into the Splunk instance that indexes all of their application logs. The application team has moved from an initial position of distrusting the network, to better understanding how it works and ultimately valuing the visibility that the network can provide for application design, optimization, and service assurance.