Manual SMP Affinity for High-PPS WordPress Deployments
Debugging Phantom I/O Wait and NIC Interrupt Storms
The environment is a KVM-based hypervisor running a Debian 12 guest. The stack consists of Nginx 1.24, PHP 8.2-FPM, and MariaDB 10.11. The guest is serving a production instance of the PoliticalWP - Political Campaign WordPress Theme. During a routine audit of the /proc/stat file, I noticed that iowait (wa) was fluctuating between 25% and 40%. Standard troubleshooting began with iostat -xz 1, but the NVMe backing stores showed 0% utilization and latencies below 0.1ms. The disks were not the bottleneck. The CPU cycles were being misreported as I/O wait due to the way the kernel handles specific hardware interrupts when the virtio_net driver is under sustained packet-per-second (PPS) pressure.
Investigation of the Interrupt Layer
The PoliticalWP theme, like many data-heavy campaign sites, relies on frequent polling for donation counters and live social feeds. This generates a high volume of small TCP packets. When examining /proc/interrupts, the virtio-input and virtio-output lines for the network interface showed an uneven distribution across the eight virtual cores. Core 0 was taking 90% of the IRQs, while cores 1 through 7 remained largely idle in terms of hardware interrupt handling. This "interrupt pinning" often causes the kernel to report iowait because the CPU is waiting for the interrupt handler to finish before it can proceed with scheduled tasks, even if the disk controller is idle.
I checked the status of irqbalance. The daemon was running but failing to redistribute the load effectively. This is a known limitation in certain virtualized environments where the CPU topology reported to the guest does not accurately reflect the physical NUMA nodes. I disabled irqbalance immediately to prevent it from overriding manual configurations. The goal was to manually steer the network interrupts to specific cores to prevent Core 0 from becoming a localized bottleneck for the entire networking stack.
Network Buffer and Ring Analysis
I utilized ethtool -S eth0 to look for dropped packets at the interface level. The counter for rx_no_dma_resources was incrementing by several hundred every minute. This indicates that the ring buffer for the receiving (RX) side was full, and the kernel could not drain the packets fast enough. When dealing with complex layouts found in the Download WooCommerce Theme category or multi-featured themes like PoliticalWP, the overhead of the TCP stack and the subsequent handoff to the web server's worker processes can be high.
The default ring buffer size was 256 descriptors. I queried this with ethtool -g eth0. To provide more headroom for the bursty nature of campaign-driven traffic, I increased the RX and TX ring buffers to their maximum supported value of 1024. This change allowed the NIC to buffer more packets during the millisecond-scale intervals when the CPU was context-switching between PHP-FPM processes.
Manual SMP Affinity Mapping
To resolve the uneven interrupt distribution, I calculated the affinity masks for the NIC interrupts. Each interrupt in /proc/irq/ has an smp_affinity file that accepts a hexadecimal bitmask. For an 8-core system, a mask of 01 corresponds to Core 0, 02 to Core 1, 04 to Core 2, and so on.
I identified the IRQ numbers for the virtio-net queues. There were eight queues in total, matching the CPU count. I mapped each queue to its corresponding core:
- IRQ 24 (queue 0) -> Core 0 (mask 01)
- IRQ 25 (queue 1) -> Core 1 (mask 02)
- IRQ 26 (queue 2) -> Core 2 (mask 04)
- IRQ 27 (queue 3) -> Core 3 (mask 08)
- IRQ 28 (queue 4) -> Core 4 (mask 10)
- IRQ 29 (queue 5) -> Core 5 (mask 20)
- IRQ 30 (queue 6) -> Core 6 (mask 40)
- IRQ 31 (queue 7) -> Core 7 (mask 80)
After applying these masks, the CPU utilization across all eight cores leveled out. The iowait reporting immediately dropped to 0.2%, confirming that the "wait" was indeed the CPU being choked by unserviced interrupts rather than slow block storage.
TCP Stack Tuning for Packet Density
The theme's modularity means a single page load can involve 50 to 80 internal requests if the object cache is not primed. This puts a heavy burden on the local loopback and the external network interface. I turned to sysctl to optimize how the kernel handles these connections. The net.ipv4.tcp_fin_timeout was reduced from 60 to 15 seconds to recycle sockets in the FIN-WAIT-2 state more aggressively.
I also increased the net.core.netdev_max_backlog from 1000 to 5000. This parameter defines the number of packets allowed to queue on the input side after being received from the NIC but before being processed by the kernel's protocol stack. With the ring buffers now at 1024, the backlog needed to be deep enough to handle a full drain of all queues simultaneously without dropping frames.
Further, I enabled net.ipv4.tcp_slow_start_after_idle = 0. This is critical for websites that use persistent connections (Keep-Alive). By default, the TCP congestion window is reset after a period of inactivity. Disabling this allows the connection to resume at its previous speed, which is beneficial for users navigating through multiple pages of a campaign site where they spend 1-2 minutes reading content before moving to the next section.
Memory Management and Slab Cache
While investigating the interrupt issue, I observed the behavior of the dentry and inode caches. High PPS can sometimes trigger a high rate of object allocation in the kernel's slab allocator. Using slabtop -o, I watched the dentry cache. The PoliticalWP theme uses many small asset files. Each time Nginx serves a file, an inode and dentry lookup occurs.
To ensure the kernel prioritizes these caches over the filesystem page cache, I adjusted vm.vfs_cache_pressure to 50. The default of 100 often causes the kernel to reclaim dentry memory too aggressively, forcing repeated disk lookups for file metadata. Reducing this value keeps the metadata in RAM longer, reducing the total system call time for each request.
The interaction between the hardware interrupt distribution and the software's ability to process those interrupts is the single most overlooked aspect of WordPress performance tuning on virtualized infrastructure. When the CPU is reporting wa, do not assume the disk is at fault. Check the interrupt vectors first.
# Manual IRQ steering script snippet
for i in {0..7}; do
irq=\((grep "virtio\){i}-input" /proc/interrupts | awk -F: '{print $1}' | tr -d ' ')
if [ -n "$irq" ]; then
mask=\((printf "%x" \)((1 << i)))
echo \(mask > /proc/irq/\)irq/smp_affinity
fi
done
# Kernel tuning
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"
Verify the smp_affinity_list for human-readable confirmation. If the wa persists despite zero disk load, investigate your hypervisor's steal time (st) to rule out noisy neighbors at the physical layer.