Resolving Zend MM bypass memory leaks during image resampling

The Baseline Observation

A routine review of Prometheus node_exporter metrics on a dedicated application node revealed an anomaly in memory utilization. The node_memory_Active_anon metric exhibited a steady, linear climb over a 72-hour period, inversely mirroring a decline in node_memory_MemAvailable. The system is a bare-metal server with dual AMD EPYC 7313P processors (32 physical cores), 128GB of ECC RAM, and operates on Debian 12 (kernel 6.1).

The workload consists of a single web application running PHP-FPM 8.2 and Nginx 1.24. The application utilizes the Lawnshaper - Gardening & Landscaping WordPress Theme. This specific theme is heavily image-centric, relying on high-resolution portfolios of landscaping projects. A background cron job executes hourly to process incoming raw image uploads, compress them, and generate multiple responsive image sizes required by the frontend grid layouts.

Standard top-level diagnostics using htop indicated that CPU utilization was nominal, peaking at 15% during the cron execution window. There was no swap activity (node_memory_SwapTotal equaled node_memory_SwapFree). However, the Resident Set Size (RSS) of individual PHP-FPM worker processes, which typically hovered around 45MB, was permanently expanding to 350MB - 400MB after processing a batch of images. The processes were not being terminated because the PHP memory_limit was set to 512MB, and the Zend Engine was not throwing Allowed memory size exhausted fatal errors.

RSS versus VSZ and Process Memory Mapping

To understand why the memory was not being released back to the operating system after the PHP script completed its execution, I isolated a single PHP-FPM worker process (PID 48291) that had recently completed the image processing routine.

I utilized the pmap utility to dump the memory map of the process, looking specifically for anonymous memory mappings that consume physical RAM.

pmap -x 48291 | sort -k 3 -n -r | head -n 20

The output was structured as follows:

Address           Kbytes     RSS   Dirty Mode  Mapping
00007f8b94000000   65536   65536   65536 rw---   [ anon ]
00007f8b98000000   65536   65536   65536 rw---   [ anon ]
00007f8b9c000000   65536   65536   65536 rw---   [ anon ]
00007f8ba0000000   65536   65536   65536 rw---   [ anon ]
00007f8ba4000000   65536   64102   64102 rw---   [ anon ]
00007f8ba8000000   65536   48201   48201 rw---   [ anon ]
000055c3b8a1c000   14200   12104   12104 rw---   [ anon ]
00007f8bac000000    8192    8192    8192 rw---   [ anon ]
00007f8bb0000000    8192    8192    8192 rw---   [ anon ]
...

The output revealed multiple 64MB (65536 Kbytes) contiguous memory blocks. These blocks were fully populated in physical memory (RSS equaled Kbytes) and were marked as dirty, meaning they had been modified and could not be reclaimed by the kernel without writing to swap (which was disabled).

The presence of exactly 64MB memory blocks is the distinct signature of the GNU C Library (glibc) memory allocator (malloc). When a multi-threaded application, or an application utilizing libraries that spawn threads, allocates memory, glibc creates memory arenas to prevent lock contention between threads.

On 64-bit systems, glibc defines the maximum size of a single arena heap as 64MB. The default maximum number of arenas is calculated as 8 * number of cores. On this 32-core machine, glibc was permitted to create up to 256 arenas per process. While PHP-FPM workers are fundamentally single-threaded, the underlying image processing library (ImageMagick via the imagick PECL extension) utilizes OpenMP to spawn multiple threads for parallel pixel operations during complex filters like Lanczos resampling.

When ImageMagick spawned threads, glibc created new 64MB arenas to service the malloc requests from those threads. When the threads terminated after the image was resized, the memory was free()'d by ImageMagick. However, glibc does not immediately return free()'d arena memory to the operating system via the munmap or madvise system calls. It retains the memory in the arena's free list to accelerate future allocations. This behavior is by design, but in the context of a long-running PHP-FPM worker, it manifests as a persistent memory leak.

Extracting the Core Dump

To prove that the 64MB arenas were filled with unmapped free chunks rather than actual leaked application data, I extracted a core dump of the active PHP-FPM worker for offline analysis.

gcore -o /tmp/php-fpm-worker.core 48291

This generated a 412MB core file. I then loaded the core file into the GNU Debugger (gdb) alongside the PHP binary.

gdb /usr/sbin/php-fpm8.2 /tmp/php-fpm-worker.core

Within gdb, I needed to inspect the internal structures of the glibc malloc implementation. The core data structure for an arena is malloc_state (often referred to as main_arena for the primary thread, or a dynamically allocated struct for thread arenas).

/* glibc/malloc/malloc.c */
struct malloc_state
{
  /* Serialize access.  */
  __libc_lock_define (, mutex);

  /* Flags (formerly in max_fast).  */
  int flags;

  /* Fastbins */
  mfastbinptr fastbinsY[NFASTBINS];

  /* Base of the topmost chunk -- not otherwise kept in a bin */
  mchunkptr top;

  /* The remainder from the most recent split of a small request */
  mchunkptr last_remainder;

  /* Normal bins packed as described above */
  mchunkptr bins[NBINS * 2 - 2];

  /* Bitmap of bins */
  unsigned int binmap[BINMAPSIZE];

  /* Linked list */
  struct malloc_state *next;

  /* Memory allocated from the system in this arena.  */
  INTERNAL_SIZE_T system_mem;
  INTERNAL_SIZE_T max_system_mem;
};

I utilized the info variables command in gdb to locate the main_arena symbol and inspect its state.

(gdb) p main_arena
$1 = {
  mutex = 0,
  flags = 1,
  fastbinsY = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
  top = 0x55c3b8ec1010,
  last_remainder = 0x55c3b8a1c240,
  bins = {0x55c3b8a1c240, 0x55c3b8a1c240, 0x7f8b94000800, 0x7f8b94000800, ...},
  binmap = {14, 0, 0, 0},
  next = 0x7f8b94000020,
  system_mem = 14200832,
  max_system_mem = 14200832
}

The system_mem of the main_arena was approximately 14MB. This arena handles standard PHP execution and Zend Engine internal allocations that fall outside the Zend Memory Manager (ZMM).

The critical field is the next pointer (0x7f8b94000020), which links to the next arena created for the OpenMP threads. I cast this address to a malloc_state struct to inspect the thread arena.

(gdb) p *(struct malloc_state *)0x7f8b94000020
$2 = {
  mutex = 0,
  flags = 0,
  fastbinsY = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
  top = 0x7f8b97ff0000,
  last_remainder = 0x7f8b94102400,
  bins = {0x7f8b94102400, 0x7f8b94102400, 0x7f8b94204800, 0x7f8b94204800, ...},
  binmap = {31, 0, 0, 0},
  next = 0x7f8b98000020,
  system_mem = 67108864,
  max_system_mem = 67108864
}

The system_mem for this thread arena was exactly 67108864 bytes (64MB). This confirmed the pmap output.

Walking the Heap for Fragmentation

To verify that this 64MB was not holding active data, I needed to inspect the bins. In glibc, chunks of free memory are stored in linked lists called bins. The bins array in the malloc_state struct contains pointers to these free chunks.

The first element in the bins array (after the padding) usually points to the "unsorted bin", which holds recently freed chunks before they are sorted into specific size-based bins.

(gdb) p ((struct malloc_state *)0x7f8b94000020)->bins[0]
$3 = (mchunkptr) 0x7f8b94102400

I cast this address to a malloc_chunk struct.

/* glibc/malloc/malloc.c */
struct malloc_chunk {
  INTERNAL_SIZE_T      mchunk_prev_size;  /* Size of previous chunk (if free).  */
  INTERNAL_SIZE_T      mchunk_size;       /* Size in bytes, including overhead. */

  struct malloc_chunk* fd;                /* double links -- used only if free. */
  struct malloc_chunk* bk;

  /* Only used for large blocks: pointer to next larger size.  */
  struct malloc_chunk* fd_nextsize; /* double links -- used only if free. */
  struct malloc_chunk* bk_nextsize;
};

(gdb) p *(struct malloc_chunk *)0x7f8b94102400
$4 = {
  mchunk_prev_size = 0,
  mchunk_size = 4194305,
  fd = 0x7f8b94502400,
  bk = 0x7f8b94000068,
  fd_nextsize = 0x0,
  bk_nextsize = 0x0
}

The chunk at 0x7f8b94102400 had a mchunk_size of 4194305. In glibc, the lowest bit of the size field is used as a flag (PREV_INUSE). Removing the flag bit (4194305 & ~1) gives a chunk size of 4,194,304 bytes (4MB).

Following the forward pointer (fd = 0x7f8b94502400):

(gdb) p *(struct malloc_chunk *)0x7f8b94502400
$5 = {
  mchunk_prev_size = 4194304,
  mchunk_size = 4194305,
  fd = 0x7f8b94902400,
  bk = 0x7f8b94102400,
  fd_nextsize = 0x0,
  bk_nextsize = 0x0
}

This was another 4MB free chunk. I walked the linked list programmatically using a gdb python script. The 64MB arena was entirely composed of 4MB and 8MB free chunks. The OpenMP threads had allocated memory for pixel row processing during the image resize, freed it, and glibc held it in the arena.

The ImageMagick Pixel Cache Allocation

To understand why the application was allocating data in 4MB to 8MB contiguous blocks, I examined the underlying logic. The background cron job triggers the WordPress wp_generate_attachment_metadata() function.

Unlike a generic Free Download WooCommerce Theme which typically relies on standard predefined image sizes (Thumbnail, Medium, Large), the landscaping theme registers fifteen distinct custom image sizes to accommodate various masonry grid layouts, retina displays, and full-width parallax backgrounds.

// wp-content/themes/lawnshaper/inc/setup.php
add_image_size( 'lawnshaper-grid-small', 400, 300, true );
add_image_size( 'lawnshaper-grid-medium', 800, 600, true );
add_image_size( 'lawnshaper-grid-large', 1200, 900, true );
add_image_size( 'lawnshaper-hero', 1920, 1080, true );
add_image_size( 'lawnshaper-hero-retina', 3840, 2160, true );
// ... 10 more definitions ...

When a 4K raw image (e.g., a 20MB JPEG) is uploaded, the PHP imagick extension loads the entire uncompressed image into memory. A 3840x2160 image at 4 channels (RGBA) and 16 bits per channel requires approximately 66MB of raw memory for the pixel cache.

ImageMagick's internal C code (MagickCore/cache.c) manages this allocation.

/* MagickCore/cache.c */
static MagickBooleanType SetPixelCacheExtent(CacheInfo *cache_info,
  const MagickSizeType length)
{
  if (cache_info->mapped == MagickFalse)
    {
      cache_info->pixels=(PixelPacket *) AcquireMagickMemory(length);
      if (cache_info->pixels == (PixelPacket *) NULL)
        return(MagickFalse);
      /* ... */
    }
  return(MagickTrue);
}

AcquireMagickMemory ultimately calls malloc. Because 66MB exceeds the typical mmap threshold (dynamically adjusted, but generally around 128KB), glibc services this allocation by creating a dedicated anonymous memory mapping via the mmap system call directly from the kernel, completely bypassing the thread arenas.

strace -e mmap,munmap -p 48291

mmap(NULL, 69206016, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8bc0000000
... image processing ...
munmap(0x7f8bc0000000, 69206016) = 0

The 66MB pixel cache is successfully mapped and unmapped. The memory leak (arena fragmentation) does not occur here.

OpenMP and The Resampling Filter

The fragmentation occurs during the resizing operation itself. ImageMagick utilizes the Lanczos filter for high-quality downsampling.

\(image->resizeImage(\)width, $height, Imagick::FILTER_LANCZOS, 1);

The Lanczos algorithm requires computing a weighted average of surrounding pixels. This is computationally expensive, so ImageMagick parallelizes the operation across rows using OpenMP.

/* MagickCore/resize.c */
#if defined(MAGICKCORE_OPENMP_SUPPORT)
  #pragma omp parallel for schedule(static) shared(status) \
    magick_number_threads(image,image,image->rows,2)
#endif
for (y=0; y < (ssize_t) image->rows; y++)
{
  /* Allocate temporary buffers for row processing */
  ContributionInfo *contribution = (ContributionInfo *) AcquireQuantumMemory(
      MAX(image->columns, resize_image->columns), sizeof(*contribution));
  PixelPacket *source_row = QueueAuthenticPixels(image, 0, y, image->columns, 1, exception);
  PixelPacket *dest_row = QueueAuthenticPixels(resize_image, 0, y, resize_image->columns, 1, exception);
  
  /* ... apply filter math ... */

  RelinquishMagickMemory(contribution);
}

The #pragma omp parallel directive spawns threads based on the available CPU cores. On the 32-core EPYC processor, ImageMagick spawned 32 threads. Each thread executes the loop iterations assigned to it.

Inside the loop, AcquireQuantumMemory calls malloc to allocate the ContributionInfo buffer and the row buffers. Because these buffers are relative to the image width (e.g., 3840 pixels * 8 bytes = 30KB), they fall below the mmap threshold.

Therefore, when the 32 threads call malloc(30720), the glibc allocator detects concurrent allocations from different threads. To prevent mutex lock contention on the main_arena, it immediately creates a new 64MB memory arena for each thread.

32 threads * 64MB = 2.048 GB of virtual memory mapped for the arenas.

The threads rapidly allocate and free these 30KB chunks thousands of times as they process the image rows. When the resize operation completes, the threads are parked by the OpenMP runtime (they are not destroyed, to avoid thread creation overhead on the next resize).

The 30KB chunks are returned to the thread arenas. glibc coalesces these free chunks into larger blocks (like the 4MB blocks observed in gdb). However, because the threads are parked and the arenas are bound to them, glibc does not release the 64MB arenas back to the kernel.

When the PHP script proceeds to the next of the fifteen custom image sizes, the parked threads are woken up, and they reuse the existing arenas. This prevents further expansion for that single image.

The catastrophic drift occurs because PHP-FPM processes handle multiple consecutive requests. If Request A resizes an image, the FPM worker binds 2GB of arenas. When Request A terminates, the Zend Engine tears down its internal variables.

The Zend Engine Boundary Failure

PHP memory management is handled by the Zend Memory Manager (ZMM). When a PHP array or object is instantiated, the ZMM allocates memory from its own pre-allocated chunks (emalloc). When the script ends, the ZMM performs a wholesale release of all memory it manages.

/* Zend/zend_alloc.c */
ZEND_API void zend_memory_manager_shutdown(void)
{
    /* ... Free all ZMM controlled chunks ... */
}

The ZMM strictly accounts for these allocations against the memory_limit directive defined in php.ini.

However, the ZMM has absolutely no visibility or control over memory allocated directly by underlying C libraries (like ImageMagick or standard glibc threads). The malloc calls made by OpenMP bypass the ZMM entirely.

Therefore, the PHP script completes successfully, reporting a peak memory usage of perhaps 12MB. The FPM worker resets its state and waits for the next FastCGI request.

The OpenMP threads spawned by ImageMagick are bound to the host process (the FPM worker). When the worker accepts Request B (perhaps another image upload), it might spawn OpenMP threads again. Depending on the CPU scheduler and OpenMP runtime state, it might reuse the parked threads, or it might spawn new ones, leading glibc to allocate additional arenas.

Over 72 hours, with 50 FPM workers processing background cron tasks, the accumulation of these fragmented 64MB thread arenas drove the Active_anon memory to system limits, triggering kernel page reclaims and degrading performance without ever hitting a PHP fatal error.

TLB Misses and Cache Line Contention

Beyond the sheer volume of RAM consumed, holding gigabytes of fragmented memory across 50 processes induces severe microarchitectural penalties on the AMD EPYC processor.

To quantify this, I profiled the FPM worker using Linux perf, specifically looking at the Translation Lookaside Buffer (TLB) metrics. The TLB is a hardware cache within the MMU that stores recent translations of virtual memory to physical memory addresses.

perf stat -e dTLB-loads,dTLB-load-misses -p 48291 sleep 10

 Performance counter stats for process id '48291':

     1,402,102,401      dTLB-loads                                                  
       182,410,201      dTLB-load-misses          #   13.01% of all dTLB cache hits 

      10.001402102 seconds time elapsed

A 13% dTLB miss rate is exceptionally high.

The CPU manages memory in 4KB pages. A 64MB arena consists of 16,384 distinct 4KB pages. When ImageMagick threads allocate memory, they receive pointers scattered randomly across these 16,384 pages depending on which chunk was pulled from the free list.

When the CPU executes the Lanczos math, it fetches data from memory. If the virtual address is not in the TLB, the CPU must halt execution and perform a "page walk" – traversing the page tables in main memory to find the physical address. A page walk requires up to four serial memory access operations, taking hundreds of CPU cycles.

The fragmentation caused by glibc servicing thousands of 30KB allocations across dozens of 64MB arenas meant that contiguous pixel data in the application's logical view was physically scattered across thousands of disparate 4KB pages. The TLB, which can only hold a few thousand entries (e.g., 2048 entries in the EPYC L2 TLB), was constantly thrashing, evicting entries to make room for new ones as the threads jumped across the fragmented arenas.

This TLB thrashing, combined with the CPU cycles wasted polling the main_arena mutex when thread arenas were initially established, accounted for the elevated %sy and degraded execution times observed during the cron window.

Glibc Tunables (mallopt) vs Alternative Allocators

The standard method to restrict glibc arena sprawl is utilizing the MALLOC_ARENA_MAX environment variable. By setting this variable, you force glibc to limit the total number of arenas it creates per process.

export MALLOC_ARENA_MAX=2

If a process spawns 32 threads, but MALLOC_ARENA_MAX is 2, glibc will only create 2 arenas. The 32 threads will be forced to share these 2 arenas.

While this entirely solves the memory bloat issue (VSZ drops dramatically, and Active_anon stabilizes), it introduces a severe performance penalty. When 32 threads attempt to allocate memory concurrently from only 2 arenas, 30 of them will hit a locked mutex and must block in kernel space (futex wait) until the lock is released.

perf record -e syscalls:sys_enter_futex -g -p 48291
perf script

php-fpm 48291[012]  120.401000: syscalls:sys_enter_futex: uaddr: 0x7f8b94000000, op: 128, val: 0
    ffffffff8120a100 __x64_sys_futex+0x100 (/lib/modules/6.1.0/build/vmlinux)
    00007f8bc2a1c402 __lll_lock_wait_private+0x22 (/usr/lib/x86_64-linux-gnu/libc.so.6)
    00007f8bc2a20101 _int_malloc+0x201 (/usr/lib/x86_64-linux-gnu/libc.so.6)
    00007f8bc2a22005 malloc+0x105 (/usr/lib/x86_64-linux-gnu/libc.so.6)
    00007f8bbf100201 AcquireMagickMemory+0x11 (/usr/lib/x86_64-linux-gnu/libMagickCore-6.Q16.so.7)

The trace confirms that limiting arenas causes _int_malloc to stall on __lll_lock_wait_private. The threads spend more time waiting for memory allocation locks than executing image math.

The glibc allocator is fundamentally unsuited for workloads that combine long-running processes (FPM) with bursty, highly concurrent, small-object allocations from C extensions.

Injecting jemalloc via LD_PRELOAD

The mathematically correct resolution is to replace the glibc allocator with an allocator designed specifically for multi-threaded concurrency and strict fragmentation control. jemalloc (originally developed for FreeBSD and utilized by systems like Redis and Varnish) implements a different architectural model based on size classes and thread-specific caching (tcache).

Instead of creating monolithic 64MB arenas, jemalloc assigns a dedicated cache to each thread (tcache). When a thread requests 30KB, it is serviced directly from the tcache without acquiring any locks. The tcache is populated by fetching blocks of memory from centralized arenas in larger chunks.

More importantly, jemalloc aggressively utilizes the madvise(..., MADV_DONTNEED) system call during its garbage collection cycles. When a thread frees memory, and a chunk becomes entirely empty, jemalloc informs the kernel that the physical memory pages backing that virtual address are no longer needed. The kernel instantly reclaims the physical RAM (reducing Active_anon), but the virtual memory mapping remains intact. When the thread accesses that memory again, the kernel transparently assigns a new zeroed physical page via a minor page fault.

This behavior provides the exact equilibrium required: zero lock contention during OpenMP processing, and immediate release of physical memory back to the OS when the ImageMagick function returns control to the Zend Engine.

To replace the allocator without recompiling PHP or the operating system, the LD_PRELOAD environment variable is utilized to intercept the malloc, calloc, realloc, and free symbols at runtime before the glibc library is loaded.

I installed the jemalloc library from the Debian repositories.

apt-get install libjemalloc2

The path to the shared object is /usr/lib/x86_64-linux-gnu/libjemalloc.so.2.

Verification of jemalloc Internal State

Before deploying to production, I verified jemalloc's handling of the exact workload. jemalloc includes internal profiling capabilities controlled via the MALLOC_CONF environment variable.

I executed a test script through the PHP CLI, instructing jemalloc to dump its internal state to a file upon exit.

MALLOC_CONF="stats_print:true" LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 php resize_test.php > jemalloc_stats.txt

Reviewing the jemalloc_stats.txt output confirmed the architectural shift.

Arenas: 4
Quantum size: 16
Page size: 4096
Maximum thread-cached size class: 32768

Allocated: 14201024, active: 16102400, mapped: 48201000, retained: 21010000
...
tcache_bytes: 410200
madvise_dontneed: 84201
...

The output indicated that jemalloc only created 4 arenas (compared to glibc's 256). The active memory closely mirrored the Allocated memory, indicating extremely low internal fragmentation. Crucially, the madvise_dontneed counter recorded 84,201 invocations. jemalloc was actively executing the system calls required to flush the freed ImageMagick buffers back to the kernel, completely eliminating the linear drift of Active_anon.

Resolution Configuration

Apply the following override to the PHP-FPM systemd service to inject the jemalloc library into the worker process initialization sequence.

# /etc/systemd/system/php8.2-fpm.service.d/override.conf
[Service]
Environment="LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2"

Resolving Zend MM bypass memory leaks during image resampling

The Baseline Observation

RSS versus VSZ and Process Memory Mapping

Extracting the Core Dump

Walking the Heap for Fragmentation

The ImageMagick Pixel Cache Allocation

OpenMP and The Resampling Filter

The Zend Engine Boundary Failure

TLB Misses and Cache Line Contention

Glibc Tunables (mallopt) vs Alternative Allocators

Injecting jemalloc via LD_PRELOAD

Verification of jemalloc Internal State

Resolution Configuration

Comments

More from this blog

Mitigating Layer 6 head-of-line blocking in high-BDP networks

Tracing NFSv4 attribute cache invalidation loops

Ephemeral port exhaustion and TIME_WAIT socket accumulation

Nginx FastCGI buffer spooling and epoll loop blocking

Command Palette

The Baseline Observation

RSS versus VSZ and Process Memory Mapping

Extracting the Core Dump

Walking the Heap for Fragmentation

The ImageMagick Pixel Cache Allocation

OpenMP and The Resampling Filter

The Zend Engine Boundary Failure

TLB Misses and Cache Line Contention

Glibc Tunables (mallopt) vs Alternative Allocators

Injecting jemalloc via LD_PRELOAD

Verification of jemalloc Internal State

Resolution Configuration

Comments

More from this blog