Mysterious kernel lock contention

Summary

A highly multi-threaded web application suffered major performance degradation when tested on a new platform with higher performing hardware and a faster operating system. I traced the problem to lock contention on a process-wide kernel memory management data structure. I fixed the performance by manually tuning a parameter in the application’s memory allocator and later helped implement an automatic way to tune the parameter.

Details

A client was evaluating several new hardware/software platforms for use in a large datacenter, running their highly multi-threaded ecommerce application. Their current platform was a rather aged Solaris/SPARC; they were testing much faster new platforms using Linux on Intel and PowerPC.

We all expected the performance to improve a lot on the new platforms, and on most of the platforms, it did. However, the performance of their application was spectacularly bad on one of the Linux-based platforms, despite the much faster hardware and Linux’s overall performance advantage for threaded applications.

I used live kernel tracing and performance monitoring to discover that multiple application threads were waiting on the spinlock that protected a per-process kernel data structure that managed the process’s memory mappings. This forced other threads to block whenever they needed to change the process’s memory maps, which was surprisingly often.

The reason the kernel was holding the lock for so long is that it was writing zeros to a large area of user memory before adding it to the process’s memory map, while holding the lock protecting the relevant data structure. This was because the process had allocated a buffer using mmap(), and for security the kernel had to clear the data before returning the buffer to the user. Otherwise, it might leak data from the kernel or another process.

Clearing this memory was entirely wasted effort. The application repeatedly allocated and freed the same size memory buffer, overwriting the entire buffer every time it allocated it. It did not need to be initialized to zeros at any point, much less while holding the lock protecting the process’s memory map structures. However, the kernel was forced to re-clear the buffer because for some reason the application repeatedly returned the memory to the kernel via munmap() and then requested the same amount of memory back with mmap().

Worse, only one thread per process could do this mmap() and clear at a time - and yet all the threads were doing it quite often, so the application spent a stunning amount of time busy waiting on this one spinlock, waiting for a thread to finish unnecessarily clearing memory so that the next thread could also unnecessarily clear memory. Why?

The application itself was calling malloc() and free() from libc rather than directly calling mmap() and munmap(), so I looked into the malloc() code. The libc malloc() code had a tuning parameter, the mmap threshold, that specified that every memory allocation larger than this parameter should be mmap()d and munmap()d every time it was allocated or freed. The default setting for this parameter had last been updated 5 years before and was now comically low.

I used mallopt() to manually set the mmap threshold to the size of the largest buffer the application allocated. The performance problems disappeared, since the application was no longer single threading while clearing megabytes of memory in the kernel.

A few months later, I collaborated with a colleague to write a very simple patch (below) to solve this problem in the general case. Our solution was to dynamically tune the mmap threshold based on the largest buffer size freed by the application. I also wrote a corresponding benchmark to demonstrate the performance problem which was later added to the Linux Test Project.

Like this story? Read more stories about solving systems problems.

Benchmark source code (with improvements from other users)

Patch to libc with long comment explaining memory mapping constraints removed:

 #endif
 
 /*
+  MMAP_THRESHOLD_MAX and _MIN are the bounds on the dynamically
+  adjusted MMAP_THRESHOLD.
+*/
+
+#ifndef DEFAULT_MMAP_THRESHOLD_MIN
+#define DEFAULT_MMAP_THRESHOLD_MIN (128 * 1024)
+#endif
+
+#ifndef DEFAULT_MMAP_THRESHOLD_MAX
+#define DEFAULT_MMAP_THRESHOLD_MAX (8 * 1024 * 1024 * sizeof(long))
+#endif
+
+/*
   M_MMAP_THRESHOLD is the request size threshold for using mmap()
   to service a request. Requests of at least this size that cannot
   be allocated using already-existing space will be serviced via mmap.
@@ -1443,12 +1456,63 @@ int      __posix_memalign(void **, size_
   "large" chunks, but the value of "large" varies across systems.  The
   default is an empirically derived value that works well in most
   systems.
 */
 
 #define M_MMAP_THRESHOLD      -3
 
 #ifndef DEFAULT_MMAP_THRESHOLD
-#define DEFAULT_MMAP_THRESHOLD (128 * 1024)
+#define DEFAULT_MMAP_THRESHOLD DEFAULT_MMAP_THRESHOLD_MIN
 #endif
 
 /*
@@ -2237,6 +2301,10 @@ struct malloc_par {
   int              n_mmaps;
   int              n_mmaps_max;
   int              max_n_mmaps;
+  /* the mmap_threshold is dynamic, until the user sets
+     it manually, at which point we need to disable any
+     dynamic behavior. */
+  int              no_dyn_threshold; 
 
   /* Cache malloc_getpagesize */
   unsigned int     pagesize;
@@ -3414,6 +3482,12 @@ public_fREe(Void_t* mem)
 #if HAVE_MMAP
   if (chunk_is_mmapped(p))                       /* release mmapped memory. */
   {
+    /* see if the dynamic brk/mmap threshold needs adjusting */
+    if (!mp_.no_dyn_threshold && (p->size > mp_.mmap_threshold) &&
+        (p->size <= DEFAULT_MMAP_THRESHOLD_MAX)) {
+      mp_.mmap_threshold = p->size; 
+      mp_.trim_threshold = 2 * mp_.mmap_threshold; 
+    }
     munmap_chunk(p);
     return;
   }
@@ -5404,10 +5478,12 @@ int mALLOPt(param_number, value) int par
 
   case M_TRIM_THRESHOLD:
     mp_.trim_threshold = value;
+    mp_.no_dyn_threshold = 1;
     break;
 
   case M_TOP_PAD:
     mp_.top_pad = value;
+    mp_.no_dyn_threshold = 1;
     break;
 
   case M_MMAP_THRESHOLD:
@@ -5418,6 +5494,7 @@ int mALLOPt(param_number, value) int par
     else
 #endif
       mp_.mmap_threshold = value;
+      mp_.no_dyn_threshold = 1;
     break;
 
   case M_MMAP_MAX:
@@ -5427,6 +5504,7 @@ int mALLOPt(param_number, value) int par
     else
 #endif
       mp_.n_mmaps_max = value;
+      mp_.no_dyn_threshold = 1;
     break;
 
   case M_CHECK_ACTION: