Summary
A business-critical network relay server for a VPN service was experiencing disastrously poor performance while serving fewer clients and less bandwidth as similar healthy servers. Network traffic was triple-encrypted making straightforward traffic analysis impossible. I analyzed the number, size, general type, and pattern of network packets to find an implement a temporary workaround, followed by a full root cause and several proposed permanent solutions.
Details
For several months, one of several dozen network relay servers was experiencing poor performance and high CPU and memory usage while serving fewer clients and less bandwidth than other healthy servers. Occasionally the relay server would run out of memory, crash, and restart, at which point the performance problems would return. Multiple clients were reporting high latency and low throughput.
A quick review of socket statistics with ss
showed that client TCP connections to the relay server had large amounts of received TCP data built up in kernel buffers, waiting for the relay server application to process them - which it could not do because it was CPU-bound. Something about the traffic hitting this particular relay server was causing higher CPU usage. Using perf
to gather CPU usage data did not find any smoking guns beyond the obvious fact that the system was CPU-bound and unable to process network data efficiently.
Unfortunately, the network traffic itself was triple-encrypted, with the ultimate content encrypted by a key we did not possess. We could not analyze the content of the packets, only external characteristics such as the size of the packets, their rate, the total numbers of each type of packet, and their source IP addresses.
To give us time to analyze the root cause, I tried several techniques to spread out the traffic to other relay servers so clients would have more acceptable performance. However, attempts to redistribute the traffic over several servers failed - whatever we did, the high resource usage ended up concentrated on a single server. Our service could not operate without functioning relay servers, and this relay server served a large fraction of client nodes in one of the world’s largest cities, so this was a serious threat to our operations.
At this point, I realized that we were close to a tipping point at which a significant percentage of our clients would be unable to use our product, due to a problem we did not have a solution for despite several weeks of work. I decided this was a business-critical emergency and started working on this problem full-time.
I analyzed the size of the packets and found that the majority of the packets were small (~100 byte) packets. Looking at relay server metrics that tracked general types of packets, I figured out they were a specific type of control packet in our protocol. I realized that servers in other locations were experiencing a smaller amount of this same traffic pattern, but not enough to affect performance yet. Now that I had multiple affected servers to look at, I could see that the pattern of which servers were affected showed that the problem traffic was coming from a single organization using our software. I used the source IP addresses to identify the specific client, which had several thousand nodes using our service.
I was now able to implement a workaround that shifted this client to its own server in the most highly loaded region, which removed the threat to our overall user base and gave us time to find the root cause. Strangely, that client never noticed any performance problems, despite the new server continuing to be overloaded.
Looking at the more detailed metrics for this type of control packet, I found that nearly all the packets were of one specific subtype: a packet that was sent only at the establishment of a new connection between client nodes. This packet asked the relay server to facilitate a connection to the target node if the source node could not reach it directly.
Ironically, that packet was sent simultaneously with a packet that attempted to connect directly to the target node using its last known location. This direct packet almost always succeeded on this client’s network, and all future client-to-client communication skipped the relay server.
This solved one mystery: the client whose traffic caused the performance problem was not experiencing any performance problems themselves because they didn’t need the relay server to do anything for them. However, other clients did sometimes need the relay server’s help to establish connections or send data, which was why they noticed performance problems when they used the same server as this client.
I did some math and found that each client node was sending a small client data packet to a few other client nodes every ~2 minutes. However, our client-to-client connection timeout was less than two minutes, so each of these client data packets required establishing a new connection, which triggered the connection establishment packet to the relay server.
This is how we discovered that the relay server process was severely rate-limited in the number of discrete messages it could process at the application level. These tiny control packets used almost no bandwidth, but a lot of CPU. The kernel could process the TCP packets they arrived in several orders of magnitude faster than the application, hence the full TCP kernel receive buffers.
I asked the client if they had installed a piece of software that made each client node send a small amount of data to a few randomly selected other client nodes approximately every two minutes, with rollout starting on a specific date. They replied that they had indeed installed such software around that date, using the default configuration. I looked up the default configuration, did some more math, and found that it almost exactly matched the number of packets I was seeing.
The problem for us was that the client software sent a number of packets that increased quadratically with the number of client nodes. As long as the packets were sent close enough together to keep our VPN connection alive, this was not a problem because they would get sent directly from client to client. But because the interval between packets between any two clients was longer than our connection timeout, every packet required an connection initiation packet to be sent to the relay server.
This is a very reasonable client traffic pattern from a very popular software package that only became a problem when every packet created another packet that had to be processed by a single server. Improving the efficiency of the relay server in processing connection patterns was hopeless: the number of connection initiation packets increased with the square of the number of client nodes, so any improvements we made would be swiftly overwhelmed when the client added a few dozen more nodes.
The long-term solution was to reduce the frequency of connection initiation packets, either by changing the connection timeout, by attempting to connect directly to the client node before requesting help from the relay server, or by removing the need for the connection instantiation packet. When I left the organization, discussion was still ongoing about which of these options to use, perhaps showing the danger of introducing a workaround that reduced the urgency of solving the problem.
Have a systems problem you need help with? Schedule a free consultation or read about more systems problems I have solved.