Fixing the network performance of a Cray with a Linux TCP proxy


In 2001, a customer had a Cray that was needed to transfer large amounts of data to other computers. After upgrading part of their network to a theoretically faster hardware, the effective network transfer bandwidth of the Cray dropped precipitously. I built an SMP Linux network appliance with a userland network proxy that allowed the Cray to transfer data over the new network hardware at close to maximum bandwidth.


A customer needed to transfer data over the network from a Cray to other machines. Initially, all of the machines used an interconnect called HiPPI. HiPPI was a specialty interconnect originally invented to connect supercomputers at Los Alamos National Labs so that they could simulate nuclear weapons explosions faster. It had a maximum bandwidth of 800 Mb/s and a maximum packet size of 64KB.

However, as of 2000, Gigabit Ethernet was replacing existing HiPPI networks, with a higher maximum bandwidth of 1 Gb/s, a maximum packet size of 9KB, and much lower cost. LANL was no longer using HiPPI in new supercomputer plans, and other HiPPI users were migrating to Gigabit Ethernet as quickly as possible. I was in the unfortunate position of working for a hardware manufacturer that only made HiPPI hardware.

This customer replaced their HiPPI network with Gigabit Ethernet and was surprised to find that the Cray’s effective network bandwidth had decreased significantly compared to the HiPPI network. After some experimentation, they discovered that the Cray was rate-limited in how many network packets it could process per second. When using HiPPI with 64KB packets, they could saturate the network bandwidth. But even after tuning the Gigabit Ethernet packet size from 1500 bytes to 9KB, the Cray still couldn’t transfer data at a high enough bandwidth for their needs. They tried connecting the Cray’s HiPPI interface to a hybrid HiPPI/Gigabit Ethernet router, and found that path MTU discovery negotiated the packet size down to 9KB on the Cray.

At this point, the customer came to my company and asked if we would write custom Linux kernel code to break up and coalesce the network packets for the Cray, and run it on a Linux appliance that would serve as a gateway to the Cray. This struck me as a singularly difficult and unreliable way to accomplish the goal, so I proposed using ipchains, the TCP proxy feature, and a userland program to copy data from one TCP end point to another, allowing two separate MTUs on the HiPPI and Gigabit Ethernet portions of the network. Latency was not important and a commodity server with 2 - 4 CPUs could easily saturate both connections with the correct tuning to socket buffer sizes and similar parameters.

During the development of the appliance, I noticed a strange bug. Whenever I closed a proxied TCP socket, the next attempt to connect to the same socket would fail. But the second attempt would succeed. Netstat showed that the socket was getting stuck in TCP_CLOSING state after the first connection closed.

A socket is usually in TCP_CLOSING state very briefly, as it is simply waiting for the final ACK in the FIN -> FIN/ACK -> ACK sequence to close a TCP connection. The next connection attempt to the same socket would fail, but would make the socket exit TCP_CLOSING state. So every other connection attempt would succeed.

What was happening? I spent a lot of time with tcpdump and various other tools and traced the TCP connection closing code through the kernel. The TCP proxy code in Linux 2.2 was implemented as a series of calls to proxy-specific functions in various parts of the TCP logic. I noticed that a function that looked up sockets for TCP over IPv4 did not call the TCP proxy lookup function, but the one for TCP over IPv6 did. We were using IPv4.

I added the missing call in the TCP over IPv4 function and no longer had any trouble with half-closed TCP connections. The patch (my first ever included in the mainline Linux kernel) is below. The finished appliance running the kernel with my bug fix solved the low Cray network bandwidth problem for the customer.

Like this story? Read more stories about solving systems problems.

 static struct sock * tcp_v4_get_sock(struct sk_buff *skb, struct tcphdr *th)
-       return tcp_v4_lookup(skb->nh.iph->saddr, th->source,
-                            skb->nh.iph->daddr, th->dest, skb->dev->ifindex);
+       if (IPCB(skb)->redirport)
+               return tcp_v4_proxy_lookup(th->dest, skb->nh.iph->saddr,
+                                          th->source,  skb->nh.iph->daddr,
+                                          skb->dev, IPCB(skb)->redirport,
+                                          skb->dev->ifindex);
+       else
+               return tcp_v4_lookup(skb->nh.iph->saddr, th->source,
+                                    skb->nh.iph->daddr, th->dest,
+                                    skb->dev->ifindex);