I/O Microscopy: Tasks’ Disk I/O Information with High Accuracy
Posted on In Systems, Virtualization
Abstract
Most popular task monitor systems (such as top, iotop, proc, etc) can only get tasks’ disk I/O information like tasks’ I/O utilization percentage every seconds due to kernel timer/tick frequency and high time cost of system interfaces. This article presents I/O Microscopy, a new way to get tasks’ disk I/O information with high accuracy. Experiments show that I/O microscopy can filter out I/O intensive tasks effectively. Introduction
As is known, I/O intensive threads in one workload (like table scan of DBMS) are the bottleneck of the whole systems. Sometimes, we need to look into them with high accuracy. However, most current existing I/O profiling/monitor systems are observing tasks’ disk I/O information every seconds. In fact, they get statistics via kernel interfaces, and as indicted in How does linux kernel collect task stats data. One has to understand that kernel is updating these parameters via timer interrupt, and in other words, timer interrupt in kernel triggers events to update these statistics. Therefore, these tools cannot be more accuracy than timer interrupts frequency. One challenge here is that how we can trigger kernel events to update these parameters with high frequency. Or we might need to find other ways to get tasks’ I/O statistics to calculate tasks’ I/O utilization percentage. Tasks’ I/O Utilization Percentage
Tasks’ I/O utilization percentage is one important metric to check whether this task is I/O intensive or not. In general, it is defined like this: the time cost of one task to handle I/O requests during one specific period. For example, the accumulate time cost, say T1, of task ‘dd if=/dev/zero of=./test …’ to be handled by disk drive during one second, and the task’s I/O utilization percentage equals to (T1 / 1 second). The Design and Implementation of I/O Microscopy
We propose the way to trace from the start time (ts) of one task’s I/O request to the completion time (tc) of this task. We monitor this procedure for this task for seceral milliseconds, say tt. At last, the I/O utilization percentage of the task will be (accumulate(tc – ts) / tt). For this way, we have to instrument kernel to count the time but fortunately, from kernel 4.1, these features are integrated by BCC (https://github.com/iovisor/bcc) developers and we can use them directly. We still need to change the source codes of BCC to implement our solution. In fact, in BCC, the accumulate delta time cost of one task to handle I/O in block I/O layer is available, we just need to change the calculation way it does. I install BCC from source codes of v0.3.0, and I change the source codes as follows.
This article proposes a new way to calculate tasks’ I/O utilization percentage with high flexibility and accuracy compared with existing solutions in current famous systems. References
[1] https://github.com/iovisor/bcc
[2] https://iovisor.github.io/bcc/
[3] http://www.brendangregg.com/
[4] https://www.usenix.org/conference/atc17/program/presentation/gregg-superpowers
[5] https://www.usenix.org/conference/atc17/program/presentation/gregg-flame
[6] https://www.kernel.org/doc/Documentation/accounting/taskstats-struct.txt
[7] Hollingsworth, Jeffrey K., Barton Paul Miller, and Jon Cargille. “Dynamic program instrumentation for scalable performance tools.” Scalable High-Performance Computing Conference, 1994., Proceedings of the. IEEE, 1994.
[8] https://github.com/iovisor/bcc/blob/master/INSTALL.md
Most popular task monitor systems (such as top, iotop, proc, etc) can only get tasks’ disk I/O information like tasks’ I/O utilization percentage every seconds due to kernel timer/tick frequency and high time cost of system interfaces. This article presents I/O Microscopy, a new way to get tasks’ disk I/O information with high accuracy. Experiments show that I/O microscopy can filter out I/O intensive tasks effectively. Introduction
As is known, I/O intensive threads in one workload (like table scan of DBMS) are the bottleneck of the whole systems. Sometimes, we need to look into them with high accuracy. However, most current existing I/O profiling/monitor systems are observing tasks’ disk I/O information every seconds. In fact, they get statistics via kernel interfaces, and as indicted in How does linux kernel collect task stats data. One has to understand that kernel is updating these parameters via timer interrupt, and in other words, timer interrupt in kernel triggers events to update these statistics. Therefore, these tools cannot be more accuracy than timer interrupts frequency. One challenge here is that how we can trigger kernel events to update these parameters with high frequency. Or we might need to find other ways to get tasks’ I/O statistics to calculate tasks’ I/O utilization percentage. Tasks’ I/O Utilization Percentage
Tasks’ I/O utilization percentage is one important metric to check whether this task is I/O intensive or not. In general, it is defined like this: the time cost of one task to handle I/O requests during one specific period. For example, the accumulate time cost, say T1, of task ‘dd if=/dev/zero of=./test …’ to be handled by disk drive during one second, and the task’s I/O utilization percentage equals to (T1 / 1 second). The Design and Implementation of I/O Microscopy
We propose the way to trace from the start time (ts) of one task’s I/O request to the completion time (tc) of this task. We monitor this procedure for this task for seceral milliseconds, say tt. At last, the I/O utilization percentage of the task will be (accumulate(tc – ts) / tt). For this way, we have to instrument kernel to count the time but fortunately, from kernel 4.1, these features are integrated by BCC (https://github.com/iovisor/bcc) developers and we can use them directly. We still need to change the source codes of BCC to implement our solution. In fact, in BCC, the accumulate delta time cost of one task to handle I/O in block I/O layer is available, we just need to change the calculation way it does. I install BCC from source codes of v0.3.0, and I change the source codes as follows.
--- biotop.bakup 2017-08-19 10:16:04.826009960 -0400
+++ biotop 2017-08-19 10:09:27.635009960 -0400
@@ -77,7 +77,7 @@
// the value of the output summary
struct val_t {
u64 bytes;
- u64 us;
+ u64 ns; //changed by Weiwei Jia
u32 io;
};
@@ -122,7 +122,7 @@
struct who_t *whop;
struct val_t *valp, zero = {};
- u64 delta_us = (bpf_ktime_get_ns() - *tsp) / 1000;
+ u64 delta_ns = bpf_ktime_get_ns() - *tsp;
// setup info_t key
struct info_t info = {};
@@ -154,7 +154,8 @@
}
// save stats
- valp->us += delta_us;
+ //bpf_trace_printk("%lu\\n", valp->us);
+ valp->ns += delta_ns;
valp->bytes += req->__data_len;
valp->io++;
@@ -183,19 +184,20 @@
exiting = 0
while 1:
try:
- sleep(10.0/1000.0)
+ sleep(100.0/1000.0)
except KeyboardInterrupt:
exiting = 1
# header
- if clear:
- call("clear")
- else:
- print()
- with open(loadavg) as stats:
- print("%-8s loadavg: %s" % (strftime("%H:%M:%S"), stats.read()))
- print("%-6s %-16s %1s %-3s %-3s %-8s %5s %7s %6s" % ("PID", "COMM",
- "D", "MAJ", "MIN", "DISK", "I/O", "Kbytes", "AVGms"))
+ #if clear:
+ # call("clear")
+ #else:
+ # print()
+ #with open(loadavg) as stats:
+ #print("%-8s loadavg: %s" % (strftime("%H:%M:%S"), stats.read()))
+ #print("%-6s %-16s %1s %-3s %-3s %-8s %5s %7s %6s" % ("PID", "COMM",
+ # "D", "MAJ", "MIN", "DISK", "I/O", "Kbytes", "AVGms"))
+ #print("%-6s %-16s %6s" % ("PID", "COMM", "IO"))
# by-PID output
counts = b.get_table("counts")
@@ -211,10 +213,15 @@
diskname = "?"
# print line
- avg_ms = (float(v.us) / 1000) / v.io
- print("%-6d %-16s %1s %-3d %-3d %-8s %5s %7s %6.2f %d" % (k.pid, k.name,
- "W" if k.rwflag else "R", k.major, k.minor, diskname, v.io,
- v.bytes / 1024, avg_ms, v.us))
+ #avg_ms = (float(v.us) / 1000) / v.io
+ #print("%-6d %-16s %1s %-3d %-3d %-8s %5s %7s %6.2f" % (k.pid, k.name,
+ # "W" if k.rwflag else "R", k.major, k.minor, diskname, v.io,
+ # v.bytes / 1024, avg_ms))
+ io_percent = ((float(v.ns) / 1000.0)/100000.0)
+ if io_percent > 0.0:
+ print("%-6d %-16s %6.5f %d" % (k.pid, k.name, io_percent, v.ns))
+ v.ns = 0
+ io_percent = 0.0
line += 1
if line >= maxrows:
I still find one bug of kernel to calculate time and it has been fixed here (https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=27727df240c7cc84f2ba6047c6f18d5addfd25ef). My proposed issue was here (https://github.com/iovisor/bcc/issues/1302).
ConclutionThis article proposes a new way to calculate tasks’ I/O utilization percentage with high flexibility and accuracy compared with existing solutions in current famous systems. References
[1] https://github.com/iovisor/bcc
[2] https://iovisor.github.io/bcc/
[3] http://www.brendangregg.com/
[4] https://www.usenix.org/conference/atc17/program/presentation/gregg-superpowers
[5] https://www.usenix.org/conference/atc17/program/presentation/gregg-flame
[6] https://www.kernel.org/doc/Documentation/accounting/taskstats-struct.txt
[7] Hollingsworth, Jeffrey K., Barton Paul Miller, and Jon Cargille. “Dynamic program instrumentation for scalable performance tools.” Scalable High-Performance Computing Conference, 1994., Proceedings of the. IEEE, 1994.
[8] https://github.com/iovisor/bcc/blob/master/INSTALL.md