Hardware and software corruption in a supercomputing cluster

Summary

A large supercomputing cluster produced the wrong output every time it ran a standard benchmark. I found two causes of silent memory corruption: a use-after-free error in the in-kernel distributed file system, and pervasive memory errors caused by faulty RAM.

Details

A university was doing quality assurance on a newly delivered supercomputing cluster running Linux and GPFS (General Parallel File System). Every time they ran a broadly-used well-tested standard linear algebra benchmark on the cluster, they got a different answer - the wrong answer. Naturally, they were unhappy with the cluster and wanted to return it.

The vendor insisted that the problem was caused by the customer’s added software. After several months of back and forth, the customer gave an ultimatum: get one run of the benchmark using the entire cluster with the correct answer within one month or they would send the cluster back. I was called in to find the cause of the benchmark errors.

I got access to the supercomputing cluster and a collection of kernel crash dumps. Running a shorter version of the benchmark on a smaller number of nodes increased the chances of getting the correct answer, but as the number of operations and nodes involved grew, the likelihood of a wrong answer increased. The pattern was consistent with silent memory corruption.

One set of kernel crash dumps showed a pattern of memory corruption that looked like a bug in the kernel code. Parts of memory were overwritten with numbers consistent with kernel addresses. I used kdb on the crash dump to find where the address of the overwritten memory was stored elsewhere in memory and found a use-after-free bug in GPFS. It had already been fixed in a subsequent release of the software and upgrading to the most recent version fixed this cause of silent memory corruption. However, the benchmark still produced wrong answers more often than not.

The other crash dumps were triggered by MCEs (Machine Check Exceptions). I noticed that the system logs had an extremely high number of kernel messages about correcting single bit errors in RAM before the MCE. The sales team insisted that this was not related to silent memory corruption because the single bit memory errors were corrected. Double bit errors couldn’t cause corruption either because they triggered an MCE and a kernel crash.

I had never before or since seen so many errors per byte of RAM. I learned that the DIMMs installed in the cluster were from the first production run from the hardware vendor, were the largest DIMMs on the market at the time, and had not been widely tested. Given the rate of corrected errors and detected but uncorrected errors, errors that were both uncorrected and undetected were statistically very likely. My conclusion was that the remaining silent memory corruption was from undetected errors in RAM.

I don’t know the end of this story. Replacing the DIMMs would have cost millions of dollars and the sales team wasn’t willing to try it before my engagement ended, so I couldn’t confirm my diagnosis. But the two weeks I spent analyzing crash dumps and system logs remains one of my favorite memories.

Like this story? Read more stories about solving systems problems.