Nondeterministic bootloader failure

Summary

The current version of the Linux kernel no longer reliably booted on a custom embedded PowerPC platform after 6 months without active testing and maintenance. The crash occurred before the kernel could write any output. I traced the problem to the accidental removal of the code in the bootloader that zeroed out the part of kernel memory that needed to be initialized to zero and added it back in.

Details

The manufacturer of a custom embedded PowerPC platform hired me to maintain the Linux distribution for their hardware. My first task was to update the kernel to the current version of Linux. Unfortunately, other PowerPC developers had changed the code for this platform without having any hardware to test it with, so I expected to find bugs.

The first time I booted the new kernel, it worked perfectly, much to my surprise! I happily recompiled the kernel with a minor cosmetic change. But when I tried to boot the new kernel, it crashed shortly after the bootloader executed the kernel. I rebooted to the first kernel and it also crashed! What happened? Did I just imagine the first time the kernel booted?

Debugging was difficult because the kernel crashed so early in boot that no output routines of any kind were available. After working on the problem for a few hours, I took a break. The embedded platform was on a milspec VME blade and the fan in the VME cage was noisy as hell, so I turned it off. When I came back, the kernel booted again! Hurray! I quickly rebooted into the same kernel; it crashed again.

I asked myself what changed? Well, I had turned off the machine for an hour. And the first time I booted it successfully, I had just turned on the machine for the first time that day. So what changed when the machine was off for an hour or more?

One answer was the memory: the electrons slowly leak out of the cells in the RAM and eventually all of them turn to zero. After an hour, the memory was more or less all zeros. But if I successfully booted the kernel and then rebooted right away, the memory would be filled with lots of random garbage. Was the problem that some part of the memory was not being initialized to zero before launching the kernel?

I used the BIOS to zero out the memory where the bootloader wrote the kernel image and then booted the kernel. It worked, every time, even if I had just rebooted.

During the 6 months the code had been edited without testing, someone had broken the part of the bootloader that zeroed out the kernel BSS, the part of its memory that should be initialized to zero when it starts up. Unsurprisingly, the kernel quickly crashed when it expected zeros and got random garbage instead.

I found where the code to zero out the kernel BSS had been accidentally removed from the bootloader and added it back in (see below). The kernel booted every time after that.

Like this story? Read more stories about solving systems problems.

/*
 * prom_init is the Gemini version of prom.c:prom_init.  We only need
 * the BSS clearing code, so I copied that out of prom.c.  This is a
 * lot simpler than hacking prom.c so it will build with Gemini. -VAL
 */

#define PTRRELOC(x)	((typeof(x))((unsigned long)(x) + offset))

unsigned long
prom_init(void)
{
	unsigned long offset = reloc_offset();
	unsigned long phys;
	extern char __bss_start, _end;

	/* First zero the BSS -- use memset, some arches don't have
	 * caches on yet */
	memset_io(PTRRELOC(&__bss_start),0 , &_end - &__bss_start);

 	/* Default */
 	phys = offset + KERNELBASE;

	gemini_prom_init();

	return phys;
}