Reducing QEMU/KVM Lock Contention with IOThreads and Dataplane
QEMU/KVM uses global locks to synchronize threads, which is necessary for correctness but causes performance problems at scale. When multiple I/O requests and vCPU threads contend for the same lock, you get lock contention that manifests as scheduling jitter, reduced system scalability, and degraded I/O performance.
IOThreads (the modern implementation of dataplane) address this by moving I/O request handling to dedicated threads, isolating them from the QEMU main loop. This removes contention between I/O processing and vCPU scheduling.
Identifying Lock Contention Problems
Lock contention typically manifests as:
- Unstable vCPU timeslices (scheduling jitter)
- High context switch rates on host CPUs
- Poor I/O throughput despite low queue depths
- Uneven load distribution across physical CPUs
Detecting Contention with perf
Kernel-level tracing reveals lock contention clearly:
perf record -e sched:sched_switch,synchronization:futex_wait -g -- <workload>
perf report
Look for excessive futex operations or context switches on I/O path CPUs. The perf output shows which locks are being contended most heavily.
Checking Context Switches with pidstat
Watch for high context switch counts on QEMU threads:
pidstat -w -p $(pidof qemu-system-x86_64) 1
High cswch (context switches) indicates the scheduler is preempting threads frequently—a symptom of lock contention. If you see thousands of switches per second on QEMU threads, that’s a strong signal.
Enabling QEMU Trace Points
Enable QEMU’s built-in trace points to see lock acquisition patterns:
qemu-system-x86_64 -trace enable=qemu_mutex_lock \
-trace enable=qemu_mutex_unlock \
-trace file=/tmp/qemu.trace
Parse the trace file to identify which locks are acquired most frequently and for how long.
Configuring IOThreads with libvirt (Recommended)
Modern QEMU (7.0+) handles IOThreads natively. This is the only approach you should use in new deployments. Define IOThreads in your domain XML:
<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
<name>guest-vm</name>
<memory unit='GiB'>16</memory>
<vcpu placement='static'>8</vcpu>
<iothreads>2</iothreads>
<iothreadids>
<iothread id='1'/>
<iothread id='2'/>
</iothreadids>
<cputune>
<!-- Pin IOThreads to isolated CPUs -->
<iothreadpin iothread='1' cpuset='8'/>
<iothreadpin iothread='2' cpuset='9'/>
<!-- Pin vCPUs to separate cores -->
<vcpupin vcpu='0' cpuset='0'/>
<vcpupin vcpu='1' cpuset='1'/>
<vcpupin vcpu='2' cpuset='2'/>
<vcpupin vcpu='3' cpuset='3'/>
<vcpupin vcpu='4' cpuset='4'/>
<vcpupin vcpu='5' cpuset='5'/>
<vcpupin vcpu='6' cpuset='6'/>
<vcpupin vcpu='7' cpuset='7'/>
<!-- Optional: Set CPU affinity for emulator thread -->
<emulatorpin cpuset='10'/>
</cputune>
<devices>
<disk type='file' device='disk'>
<driver name='qemu' type='raw' cache='none' io='native' iothread='1'/>
<source file='/var/lib/libvirt/images/guest-vm.qcow2'/>
<target dev='vda' bus='virtio'/>
</disk>
<disk type='file' device='disk'>
<driver name='qemu' type='raw' cache='none' io='native' iothread='2'/>
<source file='/var/lib/libvirt/images/guest-vm-data.qcow2'/>
<target dev='vdb' bus='virtio'/>
</disk>
<interface type='network'>
<model type='virtio'/>
<driver name='qemu' iothread='1'/>
<source network='default'/>
</interface>
</devices>
</domain>
Key configuration points:
<iothreads>: Number of dedicated I/O threads to create<iothreadpin>: Pin each IOThread to specific CPUs, typically isolated cores separate from vCPUsiothread='1'in disk and network drivers: Associates the device with a specific IOThread<emulatorpin>: Optional; pins the emulator thread to a separate core to further reduce contention
Direct QEMU Command Line (No libvirt)
If running QEMU directly without libvirt:
qemu-system-x86_64 \
-object iothread,id=io1 \
-object iothread,id=io2 \
-drive file=disk1.qcow2,format=qcow2,cache=none,aio=native,iothread=io1 \
-drive file=disk2.qcow2,format=qcow2,cache=none,aio=native,iothread=io2 \
-netdev user,id=net0 \
-device virtio-net,netdev=net0,iothread=io1 \
...
Then pin threads from the host:
# Find IOThread PIDs
ps aux | grep qemu
# Pin to CPUs
taskset -cp 8 <io-thread-pid-1>
taskset -cp 9 <io-thread-pid-2>
Working with Older QEMU (2.2-7.0)
If you’re stuck with QEMU versions between 2.2 and 7.0, you can use the legacy x-data-plane parameter:
<qemu:commandline>
<qemu:arg value='-set'/>
<qemu:arg value='device.virtio-disk0.x-data-plane=on'/>
</qemu:commandline>
This parameter is ignored in QEMU 7.0+ and has been removed entirely in later versions. Plan your upgrade path accordingly. Most distributions have moved to QEMU 8.0+ or later, so this approach is increasingly obsolete.
CPU Isolation and Pinning Strategy
Pin IOThreads to CPUs separate from vCPU threads to minimize cache misses and context switching:
# View current pinning
virsh vcpupin guest-vm
virsh iothreadpin guest-vm
# Manually adjust if needed
virsh iothreadpin guest-vm 1 8 --live
virsh vcpupin guest-vm 0 0 --live
Check your host CPU topology first:
lscpu
numactl --hardware
Ideally, pin IOThreads to CPUs on the same NUMA node as storage controllers, and vCPUs to separate nodes if the hardware supports it. For systems with SMT (hyperthreading), avoid pairing IOThreads and vCPUs on the same physical core.
I/O Scheduler Configuration
Inside the guest, use none or mq-deadline scheduler for predictable latency:
# Check current scheduler
cat /sys/block/vda/queue/scheduler
# Set to deadline
echo mq-deadline > /sys/block/vda/queue/scheduler
For the host, none works well for high-concurrency workloads. bfq provides fair I/O scheduling if you have many competing VMs.
Cache Mode Selection
The cache parameter significantly affects performance and data safety:
cache='none': Bypass host page cache entirely. Required for databases and workloads requiring synchronous I/O. Safest for data integrity.cache='directsync': Direct I/O with flush on every write. Safe but slower.cache='writeback': Use host page cache with delayed flushing. Fast but risk of data loss on host crash.cache='writethrough': Write-through caching. Good balance for many workloads.
For production databases, always use cache='none' with io='native'. For read-heavy workloads, cache='writeback' improves throughput significantly.
Monitoring IOThread Activity
Verify that IOThreads are active:
# Check if IOThreads were created
virsh qemu-monitor-command guest-vm '{"execute":"query-iothreads"}'
Output example:
{
"return": [
{
"name": "io1",
"thread-id": 12345
},
{
"name": "io2",
"thread-id": 12346
}
]
}
Monitor I/O performance with iostat:
iostat -x 1 /dev/vda
Watch for:
await: Average wait time per request (lower is better)svctm: Service time (should decrease with IOThreads)util: Utilization percentage (should be distributed across multiple threads)
Compare metrics before and after enabling IOThreads; you should see reduced await and more balanced load.
For more detailed device monitoring, use virsh domblklist and virsh domstats to track performance across multiple block devices.
When IOThreads Provide Real Benefits
IOThreads help most when:
- Multiple vCPUs generate sustained, concurrent I/O load
- I/O and CPU work overlap (not sequential bottlenecks)
- The workload is I/O-bound with queue depths > 2
- You have spare CPU cores dedicated to IOThreads (not oversubscribed)
- Storage latency is high enough that thread scheduling matters (network storage, large systems)
IOThreads add overhead and provide no benefit for:
- Light I/O workloads (< 1000 IOPS)
- CPU-bound VMs
- Single-threaded guest applications
- Heavily oversubscribed hosts with no spare cores
Measure before and after. If iostat and guest I/O benchmarks don’t improve, disable IOThreads and reclaim the CPU cores. Some workloads (particularly those with very low queue depth or light I/O) may actually regress slightly due to context switching overhead introduced by thread scheduling.
Practical Benchmarking
Use fio to test the impact of IOThreads:
# Inside guest: sequential read test
fio --name=seqread --ioengine=libaio --direct=1 --rw=read \
--bs=4k --iodepth=32 --numjobs=4 --runtime=60 --group_reporting
# Random read/write test
fio --name=randrw --ioengine=libaio --direct=1 --rw=randrw --rwmixread=70 \
--bs=4k --iodepth=32 --numjobs=4 --runtime=60 --group_reporting
Run these tests with and without IOThreads enabled, recording latency percentiles and throughput. A well-tuned IOThread setup typically shows 15-30% improvement in latency p99 and throughput for concurrent I/O workloads.

This article describes a very interesting problem. Could you give more details on 2 parts that I did not understand quite well from the current info:
1. how was the conclusion drew from the `dmesg` logs?
2. how does the `x-data-plane` function solve this problem? Some links to the good introduction of the x-data-plane may be good.
Q: how was the conclusion drew from the `dmesg` logs?
A: The normal way to trigger context switch in Linux Kernel should be like following, which is triggered by its ideal timeslice has been exhausted:
[
[
[
[
[
[
[
[
[
[
[
[
[
The abnormal way is like following (has been shown as above article), which is triggered by lock contention.
] dump_stack+0x64/0x84] __schedule+0x561/0x900] schedule+0x29/0x70] futex_wait_queue_me+0xd8/0x150] futex_wait+0x1ab/0x2b0] ? futex_wake+0x80/0x160] ? __vmx_load_host_state+0x125/0x170 [kvm_intel]] do_futex+0xf5/0xd20] ? kvm_vcpu_ioctl+0x100/0x560 [kvm]] ? __dequeue_entity+0x30/0x50] ? do_vfs_ioctl+0x93/0x520] ? native_write_msr_safe+0xa/0x10] SyS_futex+0x7d/0x170] ? fire_user_return_notifiers+0x42/0x50] ? do_notify_resume+0xc5/0x100] system_call_fastpath+0x1a/0x1f
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
Q: how does the `x-data-plane` function solve this problem? Some links to the good introduction of the x-data-plane may be good.
A: If QEMU is without ”x-data-plane” feature, the I/O worker thread in QEMU and QEMU vCPU thread shared one global QEMU lock to keep synchronization in QEMU main loop, which will cause lock contention. QEMU “x-data-plane” creates dedicated I/O thread out of QEMU main loop to handle I/O requests dedicatedly, which doesn’t need to hold QEMU global lock to handle I/O requests.
Good introduction links:
[0] http://blog.vmsplice.net/2013/03/new-in-qemu-14-high-performance-virtio.html
[1] http://events.linuxfoundation.org/sites/events/files/slides/CloudOpen2013_Khoa_Huynh_v3.pdf
[2] https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/7.0_Release_Notes/chap-Red_Hat_Enterprise_Linux-7.0_Release_Notes-Virtualization.html
[3] https://blueprints.launchpad.net/nova/+spec/add-virtio-data-plane-support-for-qemu
I will update this article to add more details later on.
Thanks for the replies!
[] futex_wait_queue_me+0xd8/0x150
[] futex_wait+0x1ab/0x2b0
seems the key sign of the problem.
The Data-Plane performance from the IBM report is quite amazing: 1.58milliion IOPS for a single VM in 2013.
Right. They are the key signs of this problem. The performance of I/O intensive workload is really improved. For me, with x-data-plane in QEMU, the timeslice of vCPU thread will be almost stable.