Skip to main content

Command Palette

Search for a command to run...

Tracing NFSv4 attribute cache invalidation loops

Updated
12 min read

The Baseline State

A regular review of Prometheus node metrics across a cluster of application servers highlighted a persistent anomaly on the tertiary worker node group. During specific background cron executions, the system CPU utilization (%sy) steadily climbed to 45%, while user space CPU utilization (%us) remained below 10%. I/O wait (%wa) was negligible, hovering around 1.2%. The hardware consists of standard compute instances with 8 vCPUs and 16GB of RAM running Debian 12 (kernel 6.1).

The application layer processes IT service tickets and generates automated SLA compliance reports in PDF format. This deployment utilizes the Solutek - Technology & IT Services WordPress Theme. The background task iterates through thousands of service attachments, verifying file integrity before compiling the final document. The storage backend for these attachments is an NFSv4 share mounted from a dedicated NetApp storage appliance over a 10Gbps private network.

Standard diagnostics using top and iotop showed PHP-FPM worker processes in the running state (R), but disk I/O throughput was less than 5MB/s. The high system CPU time indicated the kernel was spending an excessive amount of cycles executing system calls rather than waiting for physical block devices.

Kernel Tracing with ftrace

To isolate the specific kernel routines consuming the CPU cycles, I bypassed standard user-space profiling tools and utilized ftrace, the internal tracer built directly into the Linux kernel. Specifically, I deployed the function_graph tracer to map the execution flow and measure the duration of kernel functions associated with the PHP-FPM process.

I mounted the tracefs file system and isolated a single PHP-FPM worker process identifier (PID 41920) handling the background task.

mount -t tracefs nodev /sys/kernel/tracing
cd /sys/kernel/tracing
echo 0 > tracing_on
echo function_graph > current_tracer
echo 41920 > set_ftrace_pid
echo 1 > tracing_on
sleep 2
echo 0 > tracing_on
cat trace > /tmp/ftrace_output.txt

Analyzing the resulting graph output revealed a heavily repetitive pattern.

# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 2)               |  sys_newfstatat() {
 2)               |    vfs_fstatat() {
 2)               |      user_path_at_empty() {
 2)   0.812 us    |        filename_lookup();
 2)   1.210 us    |      }
 2)               |      vfs_getattr() {
 2)               |        nfs_getattr() {
 2)               |          nfs_revalidate_inode() {
 2)               |            __nfs_revalidate_inode() {
 2)               |              nfs_refresh_inode() {
 2)   0.210 us    |                spin_lock();
 2)   0.115 us    |                spin_unlock();
 2)   0.780 us    |              }
 2)               |              rpc_call_sync() {
 2)               |                rpc_execute() {
 2)               |                  rpc_do_put_task() {
 2)   0.105 us    |                    xprt_transmit();
 2) + 42.102 us   |                  }
 2) + 44.201 us   |                }
 2) + 45.110 us   |              }
 2) + 46.802 us   |            }
 2) + 47.100 us   |          }
 2) + 48.012 us   |        }
 2) + 49.102 us   |      }
 2) + 51.201 us   |    }
 2) + 52.110 us   |  }

The trace demonstrated that the sys_newfstatat system call was repeatedly invoking nfs_revalidate_inode, which subsequently triggered rpc_call_sync. This means the kernel was synchronously communicating with the NFS server to validate the metadata (attributes) of the files. The rpc_call_sync operation consumed approximately 45 microseconds per call.

When a PHP script iterates through 10,000 files, checking their status sequentially, 45 microseconds per file quickly accumulates. More importantly, the CPU core is stalled in kernel space during the RPC execution queue, driving up the %sy metric.

NFS Inode Revalidation Mechanics

To understand why the kernel forces a synchronous RPC call instead of relying on the local VFS dentry and inode caches, it is necessary to examine the NFS client implementation in the Linux kernel source tree.

The __nfs_revalidate_inode function resides in fs/nfs/inode.c.

/* fs/nfs/inode.c - Simplified for context */
int __nfs_revalidate_inode(struct nfs_server *server, struct inode *inode)
{
    int status = -ESTALE;
    struct nfs_inode *nfsi = NFS_I(inode);
    unsigned long timeo = nfs_attribute_timeout(inode);

    if (time_after(jiffies, nfsi->read_cache_jiffies + timeo)) {
        nfsi->cache_validity |= NFS_INO_INVALID_ATTR;
    }

    if (NFS_STALE(inode))
        goto out;

    if (nfsi->cache_validity & NFS_INO_INVALID_ATTR) {
        status = nfs_revalidate_inode_server(server, inode);
    } else {
        status = 0;
    }
out:
    return status;
}

The kernel maintains an internal timeout for NFS file attributes (nfs_attribute_timeout). If the current kernel time (jiffies) exceeds the time the attributes were cached (read_cache_jiffies) plus the timeout, the kernel flags the attributes as invalid (NFS_INO_INVALID_ATTR).

When this flag is set, any stat, lstat, or file_exists call from user space forces the kernel to execute nfs_revalidate_inode_server, which issues a GETATTR RPC call over the network to the NetApp appliance.

I verified the current NFS mount options to determine the configured timeout values.

cat /proc/mounts | grep nfs4
10.0.5.10:/it_attachments /var/www/html/wp-content/uploads nfs4 rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.5.20,local_lock=none,addr=10.0.5.10 0 0

The mount string showed default attribute caching parameters. In NFS, the attribute cache timeout is dynamic. It is governed by acregmin (default 3 seconds) and acregmax (default 60 seconds) for regular files. When a file is accessed, its timeout starts at acregmin. If the file is not modified, the timeout gradually increases up to acregmax.

If the PHP background job takes longer than 3 seconds to iterate back to a previously checked directory, or if it checks files that have not been accessed recently, the kernel treats the cache as expired and issues the GETATTR RPC call.

Application Layer Diagnostics (Zend Engine)

The high frequency of these stat calls required investigation at the application layer. I attached strace to a worker process for exactly one second to quantify the system calls.

strace -c -p 41920 -e trace=newfstatat & sleep 1; kill $!
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.048102           4     12025         0 newfstatat
------ ----------- ----------- --------- --------- ----------------
100.00    0.048102                 12025         0 total

12,025 newfstatat calls in a single second from a single worker process.

I examined the PHP source code within the Solutek theme handling the PDF generation. The process involves parsing an array of file paths, checking their existence, verifying permissions, and reading the file sizes to calculate the final document layout.

// wp-content/themes/solutek/inc/reports/class-pdf-compiler.php
protected function verify_attachment_integrity( $file_paths ) {
    $verified =[];
    foreach ( \(file_paths as \)path ) {
        if ( file_exists( $path ) ) {
            if ( is_readable( $path ) ) {
                \(size = filesize( \)path );
                if ( \(size > 0 && \)size < 52428800 ) {
                    $verified[] =[
                        'path' => $path,
                        'size' => $size
                    ];
                }
            }
        }
    }
    return $verified;
}

The logic appears straightforward, but it exposes a characteristic of the Zend Engine's virtual current working directory (CWD) and stat cache implementation.

In a typical local storage environment, standardizing media library management using a Free Download WooCommerce Theme involves occasional reads of product image metadata. The performance penalty of sequential stat calls is hidden by the local NVMe block layer and the Linux VFS dentry cache. However, over a network file system, the latency is amplified by orders of magnitude.

When PHP executes file_exists(), it invokes the C function php_stat within ext/standard/filestat.c.

/* ext/standard/filestat.c - PHP source snippet */
PHPAPI void php_stat(const char *filename, size_t filename_length, int type, zval *return_value)
{
    zend_stat_t stat_sb;
    php_stat_len local_len;
    int r;

    /* ... */
    r = php_sys_stat(filename, &stat_sb);
    if (r != 0) {
        RETURN_FALSE;
    }
    /* ... cache logic ... */
}

PHP does maintain its own user-land stat cache to mitigate repeated calls for the exact same file within the same request. However, file_exists(), is_readable(), and filesize() might hit the stat cache, but the underlying php_sys_stat (which translates to newfstatat) is still executed upon the first observation of the file path.

Because the verify_attachment_integrity function iterates through thousands of unique file paths, the PHP user-land stat cache is entirely ineffective. Every loop iteration is a cache miss in PHP, forcing the system call to the kernel, which in turn finds an expired acregmin timeout, forcing the RPC GETATTR over the network.

RPC XDR Packet Analysis

To definitively confirm the network overhead, I executed a packet capture filtering for the NFS protocol traffic.

tcpdump -i eth0 -s 0 -w nfs_trace.pcap port 2049

Analyzing the pcap using tshark revealed the structural depth of the XDR (External Data Representation) encoding used by RPC.

tshark -r nfs_trace.pcap -Y "nfs.opcode == 9" -T fields -e frame.time_delta -e nfs.fh.hash
0.000045000	0x4b2a1c90
0.000044102	0x8a1f4d21
0.000046011	0x1c9b2a44
0.000045201	0x9f1a2b3c

The delta time between the request and the reply was consistently 44-46 microseconds.

An NFSv4 GETATTR request is not a simple string query. The kernel must encode the file handle (a 128-byte identifier) and a bitmap of requested attributes into the XDR format.

Network File System, Ops(2): PUTFH, GETATTR
    [Program Version: 4][V4 Procedure: COMPOUND (1)]
    Tag: <EMPTY>
    minorversion: 1
    Operations (count: 2)
        Opcode: PUTFH (22)
            filehandle: 800000000000000000000000000000000000000000000000...
        Opcode: GETATTR (9)
            attr_request
                Bitmap[0]: 0x0010011a (FATTR4_TYPE, FATTR4_CHANGE, FATTR4_SIZE, FATTR4_FSID, FATTR4_FILEID)
                Bitmap[1]: 0x00a0a23a (FATTR4_MODE, FATTR4_NUMLINKS, FATTR4_OWNER, FATTR4_OWNER_GROUP, FATTR4_RAWDEV, FATTR4_SPACE_USED, FATTR4_TIME_ACCESS, FATTR4_TIME_METADATA, FATTR4_TIME_MODIFY)

The NetApp server receives this COMPOUND request, retrieves the inode data from its local proprietary file system (WAFL), encodes the integer and timestamp values back into XDR, and transmits the reply.

Processing 12,025 of these full round-trip RPC exchanges per second saturates the sunrpc task queues. The Linux kernel maintains a workqueue for RPC tasks. When a CPU core initiates a synchronous RPC call, it enters a wait state, polling or sleeping until the network stack processes the incoming XDR reply and wakes the specific rpc_task. The overhead of context switching and wait queue management under this volume is the direct source of the 45% %sy utilization.

Close-to-Open Cache Consistency

The fundamental protocol rule governing this behavior in NFS is Close-to-Open (CTO) cache consistency. CTO ensures that when an application opens a file, it will always see the data written by the last application that closed the file, regardless of which client node performed the write.

To enforce CTO, the NFS client must check if the file has been modified on the server at the moment it is opened (or stated). It does this by fetching the file's change attribute (or mtime) via GETATTR.

/* fs/nfs/dir.c - nfs_lookup implementation */
struct dentry *nfs_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags)
{
    struct dentry *res;
    struct nfs_fh *fhandle = NULL;
    struct nfs_fattr *fattr = NULL;
    /* ... */
    
    error = NFS_PROTO(dir)->lookup(dir, &dentry->d_name, fhandle, fattr, NULL);
    if (error == -ENOENT)
        goto no_entry;
    if (error < 0) {
        res = ERR_PTR(error);
        goto out;
    }
    /* ... */
}

The strictness of CTO is necessary for concurrent write environments, such as a database cluster writing to a shared logical unit. However, the IT attachment directory mapped to /var/www/html/wp-content/uploads is effectively a Write-Once-Read-Many (WORM) environment. Once a PDF or image is uploaded, it is never modified in place; it is only read during report generation.

Enforcing strict CTO for static attachments imposes network latency for zero structural benefit.

Adjusting the Mount Parameters

The resolution requires decoupling the VFS stat calls from the strict NFS attribute timeout mechanisms. We can instruct the kernel to trust its local dentry and inode caches for a longer, fixed duration, entirely bypassing the network validation for that window.

The actimeo parameter sets absolute attribute caching time in seconds, overriding the dynamic acregmin and acregmax scaling. The nocto parameter disables Close-to-Open consistency, meaning the client will not force a GETATTR upon file open if the local attribute cache is still valid according to the actimeo limit.

I unmounted and remounted the volume with the adjusted parameters.

umount /var/www/html/wp-content/uploads
mount -t nfs4 -o rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,actimeo=3600,nocto,sec=sys,clientaddr=10.0.5.20,local_lock=none 10.0.5.10:/it_attachments /var/www/html/wp-content/uploads

Setting actimeo=3600 instructs the kernel to cache file attributes for exactly one hour. For the duration of the hourly cron job, checking a file's existence will resolve entirely within the local RAM of the worker node.

I immediately re-attached strace to observe the system call latency.

strace -c -p \((pgrep -f "php-fpm" | head -n 1) -e trace=newfstatat & sleep 1; kill \)!
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.000102           0     12501         0 newfstatat
------ ----------- ----------- --------- --------- ----------------
100.00    0.000102                 12501         0 total

The time spent executing the system calls plummeted. The usecs/call dropped to near zero (sub-microsecond), as the kernel was now instantly returning the cached inode metadata directly from memory.

I verified the system CPU utilization via top. The %sy metric stabilized at 2.1%. The PDF generation task, which previously required 14 minutes to iterate through the attachments, completed in 42 seconds.

Memory Constraints of the Inode Cache

While expanding the attribute timeout eliminates network latency, it fundamentally shifts the burden to local memory. The kernel must hold the inode structures in RAM for the specified duration.

I examined the memory footprint of the NFS inodes.

cat /proc/slabinfo | grep nfs_inode_cache
nfs_inode_cache    84201  85012   1080   29    8 : tunables    0    0    0 : slabdata   2931   2931      0

The object size for an nfs_inode_cache entry is 1,080 bytes. With 84,201 active objects in the cache, the total memory consumed is approximately 90 Megabytes. Given the 16GB available on the worker nodes, this memory allocation is entirely acceptable for the performance gain.

If the attachment repository contained tens of millions of files, forcing an indefinite cache via actimeo could introduce memory pressure, requiring explicit configuration of vfs_cache_pressure to ensure the kernel reclaims the dentry structures periodically to prevent application OOM conditions.

However, in this specific topology, the fixed 3600-second eviction window provides the necessary equilibrium.

Resolution Configuration

Apply the following modifications to /etc/fstab to ensure the tuned parameters persist across reboots for read-heavy, write-once shared storage partitions.

10.0.5.10:/it_attachments /var/www/html/wp-content/uploads nfs4 rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,hard,proto=tcp,actimeo=3600,nocto,sec=sys,local_lock=none 0 0