x-data-plane feature in QEMU/KVM

Abstract
In systems, sometimes, we use one global lock to keep synchronization among different threads. This principle also happens in QEMU/KVM (http://wiki.qemu.org/Main_Page) system. However, this may cause lock contention problem. The performance/scalability of whole system will be decreased. In order to solve this problem in QEMU/KVM, x-data-plane feature is designed/implemented, which the high-level idea is “I/O requests are handled by dedicated IOThread rather than QEMU main loop threads so that it will not have lock contention among I/O threads and other QEMU main loop threads”.

How to find lock contention problem in QEMU/KVM
Test following environment settings.

I find the timeslice of vCPU thread in QEMU/KVM is unstable when there
are lots of read requests (for example, read 4KB each time (8GB in
total) from one file) from Guest OS. I also find that this phenomenon
may be caused by lock contention in QEMU layer. I find this problem
under following workload.

Workload settings:
In VMM, there are 6 pCPUs which are pCPU0, pCPU1, pCPU2, pCPU3, pCPU4,
pCPU5. There are two Kernel Virtual Machines (VM1 and VM2) upon VMM.
In each VM, there are 5 vritual CPUs (vCPU0, vCPU1, vCPU2, vCPU3,
vCPU4). vCPU0 in VM1 and vCPU0 in VM2 are pinned to pCPU0 and pCPU5
separately to handle interrupts dedicatedly. vCPU1 in VM1 and vCPU1 in
VM2 are pinned to pCPU1; vCPU2 in VM1 and vCPU2 in VM2 are pinned to
pCPU2; vCPU3 in VM1 and vCPU3 in VM2 are pinned to pCPU3; vCPU4 in VM1
and vCPU4 in VM2 are pinned to pCPU4. Besides vCPU0 in VM2 (pinned to
pCPU5), other vCPUs all have one CPU intensive thread (while(1){i++})
upon each of them in VM1 and VM2 to avoid the vCPU to be idle. In VM1,
I start one I/O thread on vCPU2, which the I/O thread reads 4KB from
one file each time (reads 8GB in total). The I/O scheduler in VM1 and
VM2 is NOOP. The I/O scheduler in VMM is CFQ. I also pinned the I/O
worker threads launched by QEMU to pCPU5 (note: there is no CPU
intensive thread on pCPU5 so the I/O requests will be handled by QEMU
I/O thread workers ASAP). The process scheduling class in VM and VMM
is CFS. The I/O bus for VM1 and VM2 should be SCSI/IDE/VIRTIO.

Linux Kernel version for VMM is: 3.16.39
Linux Kernel version for VM1 and VM2 is: 4.7.4
QEMU emulator version is: 2.0.0

When I test above workload, I find the timeslice of vCPU2 thread
jitters very much. I suspect this is triggered by lock contention in
QEMU layer since my debug log in front of VMM Linux Kernel's
schedule->__schedule->context_switch is like following. Once the
timeslice jitters very much, following debug information will appear.

7097537 Dec 13 11:22:33 mobius04 kernel: [39163.015789] Call Trace:
7097538 Dec 13 11:22:33 mobius04 kernel: [39163.015791]
[<ffffffff8176b2f0>] dump_stack+0x64/0x84
7097539 Dec 13 11:22:33 mobius04 kernel: [39163.015793]
[<ffffffff8176bf85>] __schedule+0x5b5/0x960
7097540 Dec 13 11:22:33 mobius04 kernel: [39163.015794]
[<ffffffff8176c409>] schedule+0x29/0x70
7097541 Dec 13 11:22:33 mobius04 kernel: [39163.015796]
[<ffffffff810ef4d8>] futex_wait_queue_me+0xd8/0x150
7097542 Dec 13 11:22:33 mobius04 kernel: [39163.015798]
[<ffffffff810ef6fb>] futex_wait+0x1ab/0x2b0
7097543 Dec 13 11:22:33 mobius04 kernel: [39163.015800]
[<ffffffff810eef00>] ? get_futex_key+0x2d0/0x2e0
7097544 Dec 13 11:22:33 mobius04 kernel: [39163.015804]
[<ffffffffc0290105>] ? __vmx_load_host_state+0x125/0x170 [kv
m_intel]
7097545 Dec 13 11:22:33 mobius04 kernel: [39163.015805]
[<ffffffff810f1275>] do_futex+0xf5/0xd20
7097546 Dec 13 11:22:33 mobius04 kernel: [39163.015813]
[<ffffffffc0222690>] ? kvm_vcpu_ioctl+0x100/0x560 [kvm]
7097547 Dec 13 11:22:33 mobius04 kernel: [39163.015816]
[<ffffffff810b06f0>] ? __dequeue_entity+0x30/0x50
7097548 Dec 13 11:22:33 mobius04 kernel: [39163.015818]
[<ffffffff81013d06>] ? __switch_to+0x596/0x690
7097549 Dec 13 11:22:33 mobius04 kernel: [39163.015820]
[<ffffffff811f9f23>] ? do_vfs_ioctl+0x93/0x520
7097550 Dec 13 11:22:33 mobius04 kernel: [39163.015822]
[<ffffffff810f1f1d>] SyS_futex+0x7d/0x170
7097551 Dec 13 11:22:33 mobius04 kernel: [39163.015824]
[<ffffffff8116d1b2>] ? fire_user_return_notifiers+0x42/0x50
7097552 Dec 13 11:22:33 mobius04 kernel: [39163.015826]
[<ffffffff810154b5>] ? do_notify_resume+0xc5/0x100
7097553 Dec 13 11:22:33 mobius04 kernel: [39163.015828]
[<ffffffff81770a8d>] system_call_fastpath+0x1a/0x1f

How to solve lock contention problem in QEMU/KVM

Start x-data-plane function in QEMU. It seems that older QEMU version has no this feature. My test version v2.2.0. The Libvirt XML configuration of mine is like following.

<domain type='kvm' id='2' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
  <name>kvm1</name>
  <uuid>8e9c4603-c4b5-fa41-b251-1dc4ffe1872c</uuid>
  <memory unit='KiB'>4194304</memory>
  <currentMemory unit='KiB'>4194304</currentMemory>
  <vcpu placement='static'>5</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='0'/>
    <vcpupin vcpu='1' cpuset='1'/>
    <vcpupin vcpu='2' cpuset='2'/>
    <vcpupin vcpu='3' cpuset='3'/>
    <vcpupin vcpu='4' cpuset='4'/>
  </cputune>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-i440fx-2.0'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <pae/>
  </features>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/bin/kvm-spice</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source file='/home/images/kvm1.img'/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </disk>
    <disk type='block' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <target dev='hdc' bus='ide'/>
      <readonly/>
      <alias name='ide0-1-0'/>
      <address type='drive' controller='0' bus='1' target='0' unit='0'/>
    </disk>
    <controller type='usb' index='0'>
      <alias name='usb0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'>
      <alias name='pci.0'/>
    </controller>
    <controller type='scsi' index='0'>
      <alias name='scsi0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </controller>
    <controller type='ide' index='0'>
      <alias name='ide0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <interface type='network'>
      <mac address='52:54:00:01:ab:ca'/>
      <source network='default'/>
      <target dev='vnet0'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/11'/>
      <target port='0'/>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/11'>
      <source path='/dev/pts/11'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <graphics type='vnc' port='5900' autoport='yes' listen='127.0.0.1'>
      <listen type='address' address='127.0.0.1'/>
    </graphics>
    <video>
      <model type='cirrus' vram='9216' heads='1'/>
      <alias name='video0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </memballoon>
  </devices>
  <seclabel type='none'/>
  <qemu:commandline>
    <qemu:arg value='-set'/>
    <qemu:arg value='device.virtio-disk0.scsi=off'/>
    <qemu:arg value='-set'/>
    <qemu:arg value='device.virtio-disk0.config-wce=off'/>
    <qemu:arg value='-set'/>
    <qemu:arg value='device.virtio-disk0.x-data-plane=on'/>
  </qemu:commandline>
</domain>

Conclusion
Lock contention will cause bad system scalability and should be removed from system.

References
1, http://blog.vmsplice.net/2013/03/new-in-qemu-14-high-performance-virtio.html
2, https://libvirt.org/formatdomain.html#elementsIOThreadsAllocation See also <cputune><iothreadpin> and <driver iothread=>

5 comments:

  1. This article describes a very interesting problem. Could you give more details on 2 parts that I did not understand quite well from the current info:

    1. how was the conclusion drew from the `dmesg` logs?

    2. how does the `x-data-plane` function solve this problem? Some links to the good introduction of the x-data-plane may be good.

  2. Q: how was the conclusion drew from the `dmesg` logs?

    A: The normal way to trigger context switch in Linux Kernel should be like following, which is triggered by its ideal timeslice has been exhausted:


    [] dump_stack+0x64/0x84
    [] __schedule+0x561/0x900
    [] _cond_resched+0x2a/0x40
    [] kvm_arch_vcpu_ioctl_run+0xf54/0x12d0 [kvm]
    [] ? futex_wake+0x80/0x160
    [] ? kvm_arch_vcpu_load+0x4e/0x1b0 [kvm]
    [] kvm_vcpu_ioctl+0x3f7/0x560 [kvm]
    [] ? __dequeue_entity+0x30/0x50
    [] ? __switch_to+0x596/0x690
    [] do_vfs_ioctl+0x93/0x520
    [] ? SyS_futex+0x7d/0x170
    [] SyS_ioctl+0xa1/0xb0
    [] system_call_fastpath+0x1a/0x1f

    The abnormal way is like following (has been shown as above article), which is triggered by lock contention.

    [] dump_stack+0x64/0x84
    [] __schedule+0x561/0x900
    [] schedule+0x29/0x70
    [] futex_wait_queue_me+0xd8/0x150
    [] futex_wait+0x1ab/0x2b0
    [] ? futex_wake+0x80/0x160
    [] ? __vmx_load_host_state+0x125/0x170 [kvm_intel]
    [] do_futex+0xf5/0xd20
    [] ? kvm_vcpu_ioctl+0x100/0x560 [kvm]
    [] ? __dequeue_entity+0x30/0x50
    [] ? do_vfs_ioctl+0x93/0x520
    [] ? native_write_msr_safe+0xa/0x10
    [] SyS_futex+0x7d/0x170
    [] ? fire_user_return_notifiers+0x42/0x50
    [] ? do_notify_resume+0xc5/0x100
    [] system_call_fastpath+0x1a/0x1f

    Q: how does the `x-data-plane` function solve this problem? Some links to the good introduction of the x-data-plane may be good.

    A: If QEMU is without ”x-data-plane” feature, the I/O worker thread in QEMU and QEMU vCPU thread shared one global QEMU lock to keep synchronization in QEMU main loop, which will cause lock contention. QEMU “x-data-plane” creates dedicated I/O thread out of QEMU main loop to handle I/O requests dedicatedly, which doesn’t need to hold QEMU global lock to handle I/O requests.

    Good introduction links:
    [0] http://blog.vmsplice.net/2013/03/new-in-qemu-14-high-performance-virtio.html
    [1] http://events.linuxfoundation.org/sites/events/files/slides/CloudOpen2013_Khoa_Huynh_v3.pdf
    [2] https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/7.0_Release_Notes/chap-Red_Hat_Enterprise_Linux-7.0_Release_Notes-Virtualization.html
    [3] https://blueprints.launchpad.net/nova/+spec/add-virtio-data-plane-support-for-qemu

  3. Thanks for the replies!

    [] futex_wait_queue_me+0xd8/0x150
    [] futex_wait+0x1ab/0x2b0

    seems the key sign of the problem.

    The Data-Plane performance from the IBM report is quite amazing: 1.58milliion IOPS for a single VM in 2013.

    1. Right. They are the key signs of this problem. The performance of I/O intensive workload is really improved. For me, with x-data-plane in QEMU, the timeslice of vCPU thread will be almost stable.

Leave a Reply

Your email address will not be published. Required fields are marked *