How to Measure Time Accurately in Programs

It is quite common to measure the time in programs using APIs like clock() and gettimeofday(). We may also want to measure the time “accurately” for certain purposes, such as measuring a small piece of code’s execution time for performance analysis, or measuring the time in time-sensitive game software. It is hard to measure the time very accurately. But we surely can measure the time to the granularity that we can accept for our purpose. Let’s look at possible methods.

gettimeofday and clock_gettime

gettimeofday and clock_gettime are POSIX APIs to get the time. gettimeofday is easy to use, but does not specify or tell the resolution of the system clock. For clock_gettime, clock_getres can be used to find out the resolution of a clock.

On the other hand, the calling gettimeofday and clock_gettime themselves have cost. Assume they get the time from the same source, one important factor for the accuracy is the cost (or time) for calling these APIs. At which level do these APIs cost? Is gettimeofday very slow?

A benchmark and the results by David Terei may give us a brief picture. I quote part of the results here with time and ftime although they provide granularity of seconds or micro-seconds:

time (s) => 4ns
ftime (ms) => 39ns
gettimeofday (us) => 30ns
clock_gettime (ns) => 26ns (CLOCK_REALTIME)
clock_gettime (ns) => 8ns (CLOCK_REALTIME_COARSE)
clock_gettime (ns) => 26ns (CLOCK_MONOTONIC)
clock_gettime (ns) => 9ns (CLOCK_MONOTONIC_COARSE)
clock_gettime (ns) => 170ns (CLOCK_PROCESS_CPUTIME_ID)
clock_gettime (ns) => 154ns (CLOCK_THREAD_CPUTIME_ID)

The performance/cost of gettiemofday is at 10s of ns. This cost and the fact the the actual resolution is unkown may be acceptable for many programs. These APIs on modern Linux are implemented with VDSO and are avoided to call into kernel (see a discussion here). If lower cost (10ns) and known resolution are required by the program, clock_gettime with (CLOCK_MONOTONIC_COARSE or CLOCK_REALTIME_COARSE) may be a good choice.

For even higher resolution, rdtsc may be on put the table.

rdtsc and rdtscp

rdtsc is an instruction supported since Pentium class CPUs to read the current time stamp counter (TSC) which is incremented every CPU tick (1/CPU_HZ). The TSC is a 64-bit register on x86 processors. PowerPC provides similar capability. TSC/rdtsc allow to measure time in an accurate fashion.

There are a couple of good implementations using rdtsc in C/asm on the Web, you can check them: Time-stamp counter, cycle.h and Pentium Time Stamp Counter.

Everything has two sides. You need to pay special attention to their drawbacks if you used rdtsc in your program.

First, the rdtsc instructions may not be performed in the order that they appear in the executable because of out-of-order execution. This can make one rdtsc executed later than expected and produce a misleading cycle count. Here is an example from Using the RDTSC Instruction for Performance Monitoring:

 rdtsc         ; read time stamp
 mov time, eax ; move counter into variable
 fdiv          ; floating-point divide
 rdtsc         ; read time stamp
 sub eax, time ; find the difference

This code tries to measure the time it takes to perform a floating-point division by fdiv. The fdiv will take a long time to complete and, potentially, the second rdtsc instruction could actually execute before the fdiv. If this happened, the cycle count will not be the one expected.

Inserting serializing instructions, such cpuid, which forces every preceding instructions in the code to complete before allowing the program to continue, can keep the rdtsc instructions from being performed out-of-order. The code using cpuid for the above example is as follows.

 cpuid         ; force all previous instructions to complete
 rdtsc         ; read time stamp counter
 mov time, eax ; move counter into variable
 fdiv          ; floating-point divide
 cpuid         ; wait for FDIV to complete before RDTSC
 rdtsc         ; read time stamp counter
 sub eax, time ; find the difference

An alternative way is to use rdtscp which will wait until all previous instructions have been executed before reading the counter. However, rdtscp is not supported on all CPU models. It is indicated by CPUID leaf 80000001H, EDX bit 27. If the bit is set to 1 then rdtscp is present on the processor. For more details, check https://www.systutorials.com/x86-64-isa-assembly-references#x86-64-.28and-x86.29-isa-reference/.

There are other cons with rdtsc used. Here is a list of these concerns combined from Game Timing and Multicore Processors and Time Stamp Counter which together summarize these possible problems quite well.

Discontinuous values. Multiprocessor and dual-core systems do not guarantee synchronization of their cycle counters between cores. This is exacerbated when combined with modern power management technologies that idle and restore various cores at different times, which results in the cores typically being out of synchronization. For an application, this generally results in glitches or in potential crashes as the thread jumps between the processors and gets timing values that result in large deltas, negative deltas, or halted timing.

Variability of the CPU’s frequency. Technology that changes the frequency of the CPU is in use in many high-end desktop PCs. Recent Intel processors include a constant rate TSC. While this makes time keeping more consistent, it can skew benchmarks, where a certain amount of spin-up time is spent at a lower clock rate before the OS switches the processor to the higher rate.

Portability. Reliance on the time stamp counter also reduces portability, as other processors may not have a similar feature.

Eric Ma

Eric is a systems guy. Eric is interested in building high-performance and scalable distributed systems and related technologies. The views or opinions expressed here are solely Eric's own and do not necessarily represent those of any third parties.

Leave a Reply

Your email address will not be published. Required fields are marked *