|

I/O Microscopy: Tasks’ Disk I/O Information with High Accuracy

Abstract
Most popular task monitor systems (such as top, iotop, proc, etc) can only get tasks’ disk I/O information like tasks’ I/O utilization percentage every seconds due to kernel timer/tick frequency and high time cost of system interfaces. This article presents I/O Microscopy, a new way to get tasks’ disk I/O information with high accuracy. Experiments show that I/O microscopy can filter out I/O intensive tasks effectively.

Introduction
As is known, I/O intensive threads in one workload (like table scan of DBMS) are the bottleneck of the whole systems. Sometimes, we need to look into them with high accuracy. However, most current existing I/O profiling/monitor systems are observing tasks’ disk I/O information every seconds. In fact, they get statistics via kernel interfaces, and as indicted in How does linux kernel collect task stats data. One has to understand that kernel is updating these parameters via timer interrupt, and in other words, timer interrupt in kernel triggers events to update these statistics. Therefore, these tools cannot be more accuracy than timer interrupts frequency. One challenge here is that how we can trigger kernel events to update these parameters with high frequency. Or we might need to find other ways to get tasks’ I/O statistics to calculate tasks’ I/O utilization percentage.

Tasks’ I/O Utilization Percentage
Tasks’ I/O utilization percentage is one important metric to check whether this task is I/O intensive or not. In general, it is defined like this: the time cost of one task to handle I/O requests during one specific period. For example, the accumulate time cost, say T1, of task ‘dd if=/dev/zero of=./test …’ to be handled by disk drive during one second, and the task’s I/O utilization percentage equals to (T1 / 1 second).

The Design and Implementation of I/O Microscopy
We propose the way to trace from the start time (ts) of one task’s I/O request to the completion time (tc) of this task. We monitor this procedure for this task for seceral milliseconds, say tt. At last, the I/O utilization percentage of the task will be (accumulate(tc – ts) / tt). For this way, we have to instrument kernel to count the time but fortunately, from kernel 4.1, these features are integrated by BCC (https://github.com/iovisor/bcc) developers and we can use them directly. We still need to change the source codes of BCC to implement our solution. In fact, in BCC, the accumulate delta time cost of one task to handle I/O in block I/O layer is available, we just need to change the calculation way it does. I install BCC from source codes of v0.3.0, and I change the source codes as follows.


--- biotop.bakup	2017-08-19 10:16:04.826009960 -0400
+++ biotop	2017-08-19 10:09:27.635009960 -0400
@@ -77,7 +77,7 @@
 // the value of the output summary
 struct val_t {
     u64 bytes;
-    u64 us;
+    u64 ns; //changed by Weiwei Jia
     u32 io;
 };
 
@@ -122,7 +122,7 @@
 
     struct who_t *whop;
     struct val_t *valp, zero = {};
-    u64 delta_us = (bpf_ktime_get_ns() - *tsp) / 1000;
+    u64 delta_ns = bpf_ktime_get_ns() - *tsp;
 
     // setup info_t key
     struct info_t info = {};
@@ -154,7 +154,8 @@
     }
 
     // save stats
-    valp->us += delta_us;
+    //bpf_trace_printk("%lu\\n", valp->us);
+    valp->ns += delta_ns;
     valp->bytes += req->__data_len;
     valp->io++;
 
@@ -183,19 +184,20 @@
 exiting = 0
 while 1:
     try:
-        sleep(10.0/1000.0)
+        sleep(100.0/1000.0)
     except KeyboardInterrupt:
         exiting = 1
 
     # header
-    if clear:
-        call("clear")
-    else:
-        print()
-    with open(loadavg) as stats:
-        print("%-8s loadavg: %s" % (strftime("%H:%M:%S"), stats.read()))
-    print("%-6s %-16s %1s %-3s %-3s %-8s %5s %7s %6s" % ("PID", "COMM",
-        "D", "MAJ", "MIN", "DISK", "I/O", "Kbytes", "AVGms"))
+    #if clear:
+    #    call("clear")
+    #else:
+    #    print()
+    #with open(loadavg) as stats:
+        #print("%-8s loadavg: %s" % (strftime("%H:%M:%S"), stats.read()))
+    #print("%-6s %-16s %1s %-3s %-3s %-8s %5s %7s %6s" % ("PID", "COMM",
+    #    "D", "MAJ", "MIN", "DISK", "I/O", "Kbytes", "AVGms"))
+    #print("%-6s %-16s %6s" % ("PID", "COMM", "IO"))
 
     # by-PID output
     counts = b.get_table("counts")
@@ -211,10 +213,15 @@
             diskname = "?"
 
         # print line
-        avg_ms = (float(v.us) / 1000) / v.io
-        print("%-6d %-16s %1s %-3d %-3d %-8s %5s %7s %6.2f %d" % (k.pid, k.name,
-            "W" if k.rwflag else "R", k.major, k.minor, diskname, v.io,
-            v.bytes / 1024, avg_ms, v.us))
+        #avg_ms = (float(v.us) / 1000) / v.io
+        #print("%-6d %-16s %1s %-3d %-3d %-8s %5s %7s %6.2f" % (k.pid, k.name,
+        #    "W" if k.rwflag else "R", k.major, k.minor, diskname, v.io,
+        #    v.bytes / 1024, avg_ms))
+        io_percent = ((float(v.ns) / 1000.0)/100000.0)
+        if io_percent > 0.0:
+            print("%-6d %-16s %6.5f %d" % (k.pid, k.name, io_percent, v.ns))
+            v.ns = 0
+            io_percent = 0.0
 
         line += 1
         if line >= maxrows:

I still find one bug of kernel to calculate time and it has been fixed here (https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=27727df240c7cc84f2ba6047c6f18d5addfd25ef). My proposed issue was here (https://github.com/iovisor/bcc/issues/1302).

Conclution
This article proposes a new way to calculate tasks’ I/O utilization percentage with high flexibility and accuracy compared with existing solutions in current famous systems.

References
[1] https://github.com/iovisor/bcc
[2] https://iovisor.github.io/bcc/
[3] http://www.brendangregg.com/
[4] https://www.usenix.org/conference/atc17/program/presentation/gregg-superpowers
[5] https://www.usenix.org/conference/atc17/program/presentation/gregg-flame
[6] https://www.kernel.org/doc/Documentation/accounting/taskstats-struct.txt
[7] Hollingsworth, Jeffrey K., Barton Paul Miller, and Jon Cargille. “Dynamic program instrumentation for scalable performance tools.” Scalable High-Performance Computing Conference, 1994., Proceedings of the. IEEE, 1994.
[8] https://github.com/iovisor/bcc/blob/master/INSTALL.md

Similar Posts

  • How to attach and mount Xen DomU’s disk to Dom0

    How to attach and mount Xen DomU’s disk to Dom0 To attach phy:vg_xen/vm-10.1.1.228 to xvda on Domain-0: # xm block-attach Domain-0 phy:vg_xen/vm-10.1.1.228 xvda w Mount the new partition /dev/xvda2 to /mnt/xvda2: # mount /dev/xvda2 /mnt/xvda2 After finishing using the partition, umount it and detach it: # umount /mnt/xvda2/ # xm block-detach Domain-0 xvda Read more:…

  • How to monitor temperatures of laptop on Linux

    How to monitor temperatures of laptop on Linux? This works on Linux Mint: sudo aptitude install lm-sensors hddtemp For lm-sensors, first detect the sensors by: sudo sensors-detect To detect the temperature in the system: sudo sensors To detect the HDD (e.g. sda) temperature: sudo hddtemp /dev/sda An example of the output: [zma@mini:~]$ sudo sensors acpitz-virtual-0…

  • Fedora 中文字体设置

    Fedora 一直有中文字体难看的问题, 尤其是在英文环境中. 使用本文中的配置方法可以得到令人满意的中文效果. 此方案中使用字体都为开源且在Fedora源中自带. 此方案对 Fedora 9 – 20 有效. 对于后续版本支持我会确认并更新此文章. 此方案对Gnome, KDE都有效. Firefox 中也有中文难看的问题, 后面会提到. 快速配置方法 如果你想马上配置好,请使用如下命令。此方法测试使用效果良好。 # yum install cjkuni-ukai-fonts cjkuni-uming-fonts # wget https://raw.githubusercontent.com/zma/config_files/master/others/local.conf \ -O /etc/fonts/local.conf 相关英文字体配置可以参考:Improving Fedora Font Rendering with Open Software and Fonts Only. Fedora 系统中文字体的配置方案 使用uming和ukai字体,即AR PL UMing CN等. 中文字体和等宽字体效果如图所示(点击看大图, Firefox 中文字体设置在后面会提到). 方法如下: 安装字体 首先安装这两个字体: cjkuni-ukai-fonts cjkuni-uming-fonts (在Fedora…

  • Office 2007: Save as PDF or XPS

    How to export a work file to PDF in Office 2007? 2007 Microsoft Office Add-in: Microsoft Save as PDF or XPS: http://www.microsoft.com/en-us/download/details.aspx?id=7 This download allows you to export and save to the PDF and XPS formats in eight 2007 Microsoft Office programs. It also allows you to send as e-mail attachment in the PDF and XPS…

Leave a Reply

Your email address will not be published. Required fields are marked *