Data corruption in compact flash

Summary

A client had an embedded device running Linux that used ext3 on a compact flash drive for its root file system. After a few months of usage, the file system would become corrupted. The CF manufacturer insisted that the CF was not at fault because they implemented wear-leveling at the firmware level. I showed that the pattern of corruption could only be explained by corruption inside the CF. Using my analysis, the CF manufacturer was able to reproduce the hardware errors.

Details

A client deployed a new embedded device in the field. After a few months, they noticed that when the device was power cycled, some of them would be unable to boot because the root file system was corrupted. The root file system was ext3 on a compact flash storage device.

The physical flash blocks in compact flash can be written a limited number of times before the block gets worn out and can no longer store data. To extend the life of the CF, it is best to write evenly across all the blocks, a process called wear-leveling.

ext3 does not do wear-leveling; it writes over the same blocks on disk repeatedly, especially those containing the superblock, the inode tables, the bitmaps, the block groups, and the journal blocks. If used directly on bare CF, ext3 will quickly wear out the flash and cause data corruption. However, the CF manufacturer had assured the client that their CF did wear-leveling in firmware and that it was safe to use a journaling file system like ext3 on top.

I was brought on at this point to find the cause of the file system corruption. I was given several images of the corrupted file systems. I wrote a script to visualize the on-disk data patterns in ASCII showing a single character to represent the data in each 512 byte block.

Here is the output for a typical corrupted file system:

Key:

| = all 1’s . = all 0’s x = mix of 1’s and 0’s

Data:

|xx.xxx.xxx||||||xxx.........||||...............................
.||||||||............||||....||||....||||....||||........|||||||
|....xx||....||||||||||||....||||....||||||||........||||||||...
................................................................
................................................................
................................................................
.........................................................|||||||
|||||....|||||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||.....................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................

The corrupted blocks in this image were the runs of “|” - that is, 512 byte blocks in which every bit was set to 1. Other images showed corrupted blocks in which the corrupt data was only partially set to 1’s.

The corrupted data most often occurred in blocks of 2048 bytes, beginning at an offset that was aligned at 2048 bytes from the beginning of the disk—plus an additional 512 bytes! These 2048 byte blocks sometimes alternated with uncorrupted or differently corrupted 2048 byte blocks. Here is one example, with the beginning of each line aligned to a 2048 byte boundary plus 512 bytes, showing two different patterns of corruption in alternating 2048 byte blocks (“xxxx” and “||||”):

....xx||
....||||
........
xxxx||||
xxxx....
xxxx....
xxxx....
xxxx..||
xxxx....
xxxx....
xxxx||||
xxxx||||
xx......
  ^ first journal block here

I immediately suspected hardware corruption due to poorly implemented wear-leveling that was wearing out specific physical flash blocks. Here’s why.

The underlying unit of IO at the operating system level is 512 bytes and at the file system level it is 4096 bytes, both aligned to the same unit. Corruption caused by software would most likely show up in a pattern that was 512 bytes aligned to 512 bytes, or 4096 bytes aligned to 4096 bytes, not 2048 bytes aligned only to 2048 bytes plus 512 bytes.

The data content of the corruption didn’t match either. Usually, software corruption would contain the contents of a random location in memory, a slightly corrupted version of what should have been written to disk, all 0’s because it was writing a null-initialized part of memory, or a poisoning pattern from a freed part of memory. Instead, it was all or mostly 1’s.

However, the pattern was a perfect match for worn-out CF blocks, presented via wear-leveling firmware. The job of the firmware can be summarized as remapping the operating systems read/write requests for a logical block address to different underlying physical blocks of hardware. The operating system issues writes as small as 512 byte blocks while the underlying physical blocks are larger.

Writing to one logical 512 byte block requires a rewrite of the entire underlying physical block because of how flash hardware works. To write a single logical 512 byte block, the flash hardware must:

Choose a new physical block as the write target.
Erase the new physical block, which is usually larger than the logical 512 byte block.
Copy over the original physical block to the new physical block, except with the new contents of the logical 512 byte write.
Update the device’s internal mapping of logical to physical blocks.

At the time (2008), wear-leveling firmware was often not very good. SSDs were just getting good enough that Apple was beginning to use them in its laptops with HFS+, but cheaper devices like CF often suffered data corruption relatively quickly.

After I sent this analysis, the manufacturer shared more information about how it implemented wear-leveling. In their implementation, when the flash cells wore out, the bits would tend towards 1. Their hardware required erasing either all the odd blocks or all the even blocks within a physical block. I knew from storage research that implementing a full any-to-any logical-to-physical mapping in the device was more expensive than this manufacturer could afford at this price point, so likely they were reusing a smaller set of physical blocks over and over for writes to the same logical blocks. A gradual increase of bits flipped to 1’s in blocks of 2048 bytes aligned to odd 512 byte boundaries matched corruption at the hardware and firmware level perfectly.

One of the challenges with this data corruption problem was that all the corrupted disk blocks were either ignored by the file system, or pinned in kernel memory after boot and never reread from disk until the system rebooted. Therefore the data corruption was undetected until the first reboot after the corruption - at which point the system was unusable. I wrote several tools to detect the corruption while the system was running, one of which is below.

Using my analysis and tools, the CF manufacturer was able to reproduce the data corruption and accepted responsibility for the file system corruption.

Like this story? Read more stories about solving systems problems.

/*
 * Check if the on-disk and buffer cache copies of a block are the
 * same.
 */

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <time.h>
#include <string.h>
#include <malloc.h>

#ifndef O_DIRECT
#define O_DIRECT	00040000	/* direct disk access hint */
#endif

#define BSIZE 1024

static size_t start_block = 0;
static size_t end_block = 524;

static char *cmd;

static void
usage(void)
{
	fprintf(stderr, "Usage: %s [-s start_block] [-e end_block] <device>\n",
		cmd);
	exit(1);
}

static void
read_data(int fd, char *buf, size_t bytes)
{
	int bytes_read = 0;
	int remaining = bytes;
	int n;

	while ((n = read(fd, buf + bytes_read, remaining)) > 0) {
		bytes_read += n;
		remaining -= n;
	}
}

/* XXX Print to file and use od */

static void
dump_data(char *buf)
{
	int i;

	for(i = 0; i < BSIZE; i++) {
		printf("%02hhx ", buf[i]);
		if ((i % 16) == 15)
			printf("\n");
	}
}

int
main(int argc, char *argv[])
{
	char *cached_data;
	char *ondisk_data;
	char all_ones[BSIZE];
	int cached_fd;
	int ondisk_fd;
	char *dev;
	int corruption_found = 0;
	size_t i;
	
	char c;

	while ((c = getopt(argc, argv, "e:s:")) != EOF) {
		switch (c) {
		case 'e':
			end_block = strtoull(optarg, NULL, 0);
			break;
		case 's':
			start_block = strtoull(optarg, NULL, 0);
			break;
		}
	}

	if (optind != argc - 1)
		usage();

	dev = argv[optind];
	cached_fd = open(dev, O_RDONLY);
	if (cached_fd < 0) {
		printf("Couldn't open dev %s\n", dev);
		exit(1);
	}

	ondisk_fd = open(dev, O_RDONLY | O_DIRECT);
	if (ondisk_fd < 0) {
		printf("Couldn't open dev %s\n", dev);
		exit(1);
	}

	if ((cached_data = memalign(BSIZE, BSIZE)) == NULL) {
		printf("Couldn't allocate aligned memory\n");
		exit(1);
	}
		
	if ((ondisk_data = memalign(BSIZE, BSIZE)) == NULL) {
		printf("Couldn't allocate aligned memory\n");
		exit(1);
	}
		
	memset(all_ones, 0xff, BSIZE);

	while (1) {
		for (i = start_block; i < end_block; i++) {

			lseek(cached_fd, i * BSIZE, SEEK_SET);
			lseek(ondisk_fd, i * BSIZE, SEEK_SET);

			/* Reduce chances of a race */
			sync();

			read_data(cached_fd, cached_data, BSIZE);
			read_data(ondisk_fd, ondisk_data, BSIZE);
			/*
			 * Still a race condition here, but only when
			 * a bit in a bitmap block which is set all to
			 * ones transitions.  Very annoying.
			 */
			if ((memcmp(cached_data, ondisk_data, BSIZE) != 0) &&
			    (memcmp(all_ones, ondisk_data, BSIZE) == 0))
			{
				printf("\nCorruption found!\n");
				printf("Cached data, offset %u\n", i * BSIZE);
				dump_data(cached_data);
				printf("\nOn-disk data, offset %u\n", i * BSIZE);
				dump_data(ondisk_data);
				/* Keep checking the rest of the blocks */
				corruption_found = 1;
			}
		}
		if (corruption_found)
			exit(2);
		/* Don't check excessively */
		sleep(1);
	}
	/* Not reached */
	return 0;
}