I/O Microscopy: Tasks’ Disk I/O Information with High Accuracy

Abstract
Most popular task monitor systems (such as top, iotop, proc, etc) can only get tasks’ disk I/O information like tasks’ I/O utilization percentage every seconds due to kernel timer/tick frequency and high time cost of system interfaces. This article presents I/O Microscopy, a new way to get tasks’ disk I/O information with high accuracy. Experiments show that I/O microscopy can filter out I/O intensive tasks effectively.

Introduction
As is known, I/O intensive threads in one workload (like table scan of DBMS) are the bottleneck of the whole systems. Sometimes, we need to look into them with high accuracy. However, most current existing I/O profiling/monitor systems are observing tasks’ disk I/O information every seconds. In fact, they get statistics via kernel interfaces, and as indicted in How does linux kernel collect task stats data. One has to understand that kernel is updating these parameters via timer interrupt, and in other words, timer interrupt in kernel triggers events to update these statistics. Therefore, these tools cannot be more accuracy than timer interrupts frequency. One challenge here is that how we can trigger kernel events to update these parameters with high frequency. Or we might need to find other ways to get tasks’ I/O statistics to calculate tasks’ I/O utilization percentage.

Tasks’ I/O Utilization Percentage
Tasks’ I/O utilization percentage is one important metric to check whether this task is I/O intensive or not. In general, it is defined like this: the time cost of one task to handle I/O requests during one specific period. For example, the accumulate time cost, say T1, of task ‘dd if=/dev/zero of=./test …’ to be handled by disk drive during one second, and the task’s I/O utilization percentage equals to (T1 / 1 second).

The Design and Implementation of I/O Microscopy
We propose the way to trace from the start time (ts) of one task’s I/O request to the completion time (tc) of this task. We monitor this procedure for this task for seceral milliseconds, say tt. At last, the I/O utilization percentage of the task will be (accumulate(tc – ts) / tt). For this way, we have to instrument kernel to count the time but fortunately, from kernel 4.1, these features are integrated by BCC (https://github.com/iovisor/bcc) developers and we can use them directly. We still need to change the source codes of BCC to implement our solution. In fact, in BCC, the accumulate delta time cost of one task to handle I/O in block I/O layer is available, we just need to change the calculation way it does. I install BCC from source codes of v0.3.0, and I change the source codes as follows.


--- biotop.bakup	2017-08-19 10:16:04.826009960 -0400
+++ biotop	2017-08-19 10:09:27.635009960 -0400
@@ -77,7 +77,7 @@
 // the value of the output summary
 struct val_t {
     u64 bytes;
-    u64 us;
+    u64 ns; //changed by Weiwei Jia
     u32 io;
 };
 
@@ -122,7 +122,7 @@
 
     struct who_t *whop;
     struct val_t *valp, zero = {};
-    u64 delta_us = (bpf_ktime_get_ns() - *tsp) / 1000;
+    u64 delta_ns = bpf_ktime_get_ns() - *tsp;
 
     // setup info_t key
     struct info_t info = {};
@@ -154,7 +154,8 @@
     }
 
     // save stats
-    valp->us += delta_us;
+    //bpf_trace_printk("%lu\\n", valp->us);
+    valp->ns += delta_ns;
     valp->bytes += req->__data_len;
     valp->io++;
 
@@ -183,19 +184,20 @@
 exiting = 0
 while 1:
     try:
-        sleep(10.0/1000.0)
+        sleep(100.0/1000.0)
     except KeyboardInterrupt:
         exiting = 1
 
     # header
-    if clear:
-        call("clear")
-    else:
-        print()
-    with open(loadavg) as stats:
-        print("%-8s loadavg: %s" % (strftime("%H:%M:%S"), stats.read()))
-    print("%-6s %-16s %1s %-3s %-3s %-8s %5s %7s %6s" % ("PID", "COMM",
-        "D", "MAJ", "MIN", "DISK", "I/O", "Kbytes", "AVGms"))
+    #if clear:
+    #    call("clear")
+    #else:
+    #    print()
+    #with open(loadavg) as stats:
+        #print("%-8s loadavg: %s" % (strftime("%H:%M:%S"), stats.read()))
+    #print("%-6s %-16s %1s %-3s %-3s %-8s %5s %7s %6s" % ("PID", "COMM",
+    #    "D", "MAJ", "MIN", "DISK", "I/O", "Kbytes", "AVGms"))
+    #print("%-6s %-16s %6s" % ("PID", "COMM", "IO"))
 
     # by-PID output
     counts = b.get_table("counts")
@@ -211,10 +213,15 @@
             diskname = "?"
 
         # print line
-        avg_ms = (float(v.us) / 1000) / v.io
-        print("%-6d %-16s %1s %-3d %-3d %-8s %5s %7s %6.2f %d" % (k.pid, k.name,
-            "W" if k.rwflag else "R", k.major, k.minor, diskname, v.io,
-            v.bytes / 1024, avg_ms, v.us))
+        #avg_ms = (float(v.us) / 1000) / v.io
+        #print("%-6d %-16s %1s %-3d %-3d %-8s %5s %7s %6.2f" % (k.pid, k.name,
+        #    "W" if k.rwflag else "R", k.major, k.minor, diskname, v.io,
+        #    v.bytes / 1024, avg_ms))
+        io_percent = ((float(v.ns) / 1000.0)/100000.0)
+        if io_percent > 0.0:
+            print("%-6d %-16s %6.5f %d" % (k.pid, k.name, io_percent, v.ns))
+            v.ns = 0
+            io_percent = 0.0
 
         line += 1
         if line >= maxrows:

I still find one bug of kernel to calculate time and it has been fixed here (https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=27727df240c7cc84f2ba6047c6f18d5addfd25ef). My proposed issue was here (https://github.com/iovisor/bcc/issues/1302).

Conclution
This article proposes a new way to calculate tasks’ I/O utilization percentage with high flexibility and accuracy compared with existing solutions in current famous systems.

References
[1] https://github.com/iovisor/bcc
[2] https://iovisor.github.io/bcc/
[3] http://www.brendangregg.com/
[4] https://www.usenix.org/conference/atc17/program/presentation/gregg-superpowers
[5] https://www.usenix.org/conference/atc17/program/presentation/gregg-flame
[6] https://www.kernel.org/doc/Documentation/accounting/taskstats-struct.txt
[7] Hollingsworth, Jeffrey K., Barton Paul Miller, and Jon Cargille. “Dynamic program instrumentation for scalable performance tools.” Scalable High-Performance Computing Conference, 1994., Proceedings of the. IEEE, 1994.
[8] https://github.com/iovisor/bcc/blob/master/INSTALL.md

Weiwei Jia

Weiwei Jia is a Ph.D. student in the Department of Computer Science at New Jersey Institute of Technology since 2016. His research interests are include storage systems, operating systems and computer systems.

Leave a Reply

Your email address will not be published. Required fields are marked *