Measuring Task I/O With EBPF: Microsecond Precision

Most popular task monitoring tools (top, iotop, procfs) sample disk I/O information at granularity limited by kernel timer frequency—typically once per second. This sampling rate misses short-lived I/O bursts and makes it difficult to accurately measure which tasks are truly I/O bound. eBPF and kernel tracepoints let you capture microsecond-level accuracy without heavy overhead.

Why Sampling Isn’t Enough

Standard Linux task accounting collects I/O statistics via kernel timer interrupts. On most systems, this means updates happen every 10ms at best, but often only once per second. Tools reading /proc/[pid]/io or using the taskstats interface inherit this limitation.

Consider a workload where a task performs many small I/O operations in quick succession—perhaps a database table scan or a log aggregation pipeline. A 1Hz sample rate will miss most activity. You need to measure the actual time tasks spend blocked waiting for disk I/O completion.

Defining I/O Utilization Accurately

Task I/O utilization is the accumulated time a task spends waiting for disk I/O operations to complete, divided by wall-clock time.

For example, if a task executes dd if=/dev/zero of=./testfile bs=1M count=1000 and spends 5 seconds blocked waiting for disk writes to finish during a 10-second window, its I/O utilization is 50%.

The challenge is capturing this timing at sufficient granularity without heavy kernel overhead.

Using eBPF for Microsecond-Level Tracing

Since kernel 4.1, eBPF provides efficient kernel instrumentation. Modern eBPF programs can hook into block device I/O tracepoints with microsecond timestamps, track individual I/O requests through their lifecycle, and accumulate timing per task without frequent context switches to userspace.

The approach:

Trace the start time when a task submits an I/O request to the block layer
Trace the completion time when the I/O completes
Calculate the delta (time spent in flight)
Accumulate deltas per PID
Report I/O utilization as (total_accumulated_time / observation_window)

Implementation with eBPF

Here’s a practical eBPF program that hooks block I/O tracepoints and tracks latency per task:

// biolatency.c - measure block I/O latency per task
#include <uapi/linux/ptrace.h>
#include <linux/blkdev.h>

typedef struct {
    u32 pid;
    char name[TASK_COMM_LEN];
} req_key_t;

typedef struct {
    u64 total_ns;     // total time spent in I/O
    u64 io_count;     // number of I/O operations
    u64 bytes;        // total bytes
} req_val_t;

BPF_HASH(inflight, struct request *, u64);
BPF_HASH(stats, req_key_t, req_val_t);

TRACEPOINT_PROBE(block, block_rq_issue) {
    u64 ts = bpf_ktime_get_ns();
    inflight.update(&args->req, &ts);
    return 0;
}

TRACEPOINT_PROBE(block, block_rq_complete) {
    u64 *tsp, delta_ns;
    struct task_struct *task = (struct task_struct *)bpf_get_current_task_btf();

    tsp = inflight.lookup(&args->req);
    if (!tsp)
        return 0;

    delta_ns = bpf_ktime_get_ns() - *tsp;
    inflight.delete(&args->req);

    req_key_t key = {};
    key.pid = bpf_get_current_pid_tgid() >> 32;
    bpf_get_current_comm(&key.name, sizeof(key.name));

    req_val_t *val = stats.lookup_or_try_init(&key, NULL);
    if (val) {
        __sync_fetch_and_add(&val->total_ns, delta_ns);
        __sync_fetch_and_add(&val->io_count, 1);
        __sync_fetch_and_add(&val->bytes, args->len);
    }

    return 0;
}

The Python userspace program:

#!/usr/bin/env python3
import sys
import time
from bcc import BPF

BPF_CODE = """
(kernel code above)
"""

b = BPF(text=BPF_CODE)
stats_table = b["stats"]

try:
    while True:
        time.sleep(1)
        print("\n%-6s %-16s %10s %8s %12s" % 
              ("PID", "COMM", "IO/sec", "MB/sec", "Avg Latency(us)"))

        for k, v in sorted(stats_table.items(), 
                          key=lambda item: item[1].total_ns, reverse=True):
            io_per_sec = v.io_count
            mb_per_sec = v.bytes / (1024 * 1024)
            avg_latency_us = v.total_ns / v.io_count / 1000 if v.io_count else 0

            print("%-6d %-16s %10d %8.2f %12.2f" % 
                  (k.pid, k.name.decode(), io_per_sec, mb_per_sec, avg_latency_us))

        stats_table.clear()

except KeyboardInterrupt:
    sys.exit(0)

Run it with:

sudo python3 biolatency.py

This will output a real-time view of I/O operations per task, updated every second.

Advantages Over Traditional Monitoring

Microsecond resolution: Captures every I/O operation, not kernel timer samples
Per-task accuracy: Distinguishes between truly I/O-bound tasks and false positives
Low overhead: eBPF executes in kernel with minimal context switching
No kernel rebuilds: Works with standard kernels supporting eBPF (4.1+)

Kernel Requirements and Compatibility

Modern kernels (5.10+) have stable bpf_get_current_task_btf() and improved tracepoint support. If running on older kernels (4.1–5.9), you may need to adjust task_struct field accesses. The nanosecond timing accuracy issues present in earlier kernels were fixed in stable kernels 5.1+, so modern deployments won’t encounter them.

Production Considerations

For production deployments, consider these enhancements:

Device filtering: Hook block_rq_issue to filter by major/minor device numbers if tracking specific storage only. This reduces noise from unrelated I/O.

Percentile latency: Track p50, p95, p99 instead of just averages using BPF histograms:

BPF_HISTOGRAM(io_latency, u64);
// In block_rq_complete:
io_latency.increment(bpf_log2l(delta_ns));

Per-cgroup tracking: Group I/O by cgroup v2 for container and Kubernetes environments:

typedef struct {
    u64 cgroup_id;
    char name[TASK_COMM_LEN];
} cgroup_key_t;

BPF_HASH(cgroup_stats, cgroup_key_t, req_val_t);
// Use bpf_get_current_cgroup_id() to key by cgroup

Long-running aggregation: Keep statistics across multiple observation windows for trend analysis rather than clearing each interval. Track peak I/O rates and identify slow periods.

Storage integration: Filter by device type (NVMe, HDD, network block device) to separate local and remote I/O characteristics.

These enhancements help surface performance problems that would be invisible with traditional sampling approaches.

Measuring Task I/O with eBPF: Microsecond Precision