It is quite common to measure the time in programs using APIs like
gettimeofday(). We may also want to measure the time “accurately” for certain purposes, such as measuring a small piece of code’s execution time for performance analysis, or measuring the time in time-sensitive game software. It is hard to measure the time very accurately. But we surely can measure the time to the granularity that we can accept for our purpose. Let’s look at possible methods.
gettimeofday and clock_gettime ∞
clock_gettime are POSIX APIs to get the time.
gettimeofday is easy to use, but does not specify or tell the resolution of the system clock. For
clock_getres can be used to find out the resolution of a clock.
On the other hand, the calling
clock_gettime themselves have cost. Assume they get the time from the same source, one important factor for the accuracy is the cost (or time) for calling these APIs. At which level do these APIs cost? Is
gettimeofday very slow?
A benchmark and the results by David Terei may give us a brief picture. I quote part of the results here with
ftime although they provide granularity of seconds or micro-seconds:
time (s) => 4ns ftime (ms) => 39ns gettimeofday (us) => 30ns clock_gettime (ns) => 26ns (CLOCK_REALTIME) clock_gettime (ns) => 8ns (CLOCK_REALTIME_COARSE) clock_gettime (ns) => 26ns (CLOCK_MONOTONIC) clock_gettime (ns) => 9ns (CLOCK_MONOTONIC_COARSE) clock_gettime (ns) => 170ns (CLOCK_PROCESS_CPUTIME_ID) clock_gettime (ns) => 154ns (CLOCK_THREAD_CPUTIME_ID)
The performance/cost of
gettiemofday is at 10s of ns. This cost and the fact the the actual resolution is unkown may be acceptable for many programs. These APIs on modern Linux are implemented with VDSO and are avoided to call into kernel (see a discussion here). If lower cost (10ns) and known resolution are required by the program,
clock_gettime with (CLOCK_MONOTONIC_COARSE or CLOCK_REALTIME_COARSE) may be a good choice.
For even higher resolution,
rdtsc may be on put the table.
rdtsc and rdtscp ∞
rdtsc is an instruction supported since Pentium class CPUs to read the current time stamp counter (TSC) which is incremented every CPU tick (1/CPU_HZ). The TSC is a 64-bit register on x86 processors. PowerPC provides similar capability. TSC/
rdtsc allow to measure time in an accurate fashion.
Everything has two sides. You need to pay special attention to their drawbacks if you used
rdtsc in your program.
rdtsc instructions may not be performed in the order that they appear in the executable because of out-of-order execution. This can make one
rdtsc executed later than expected and produce a misleading cycle count. Here is an example from Using the RDTSC Instruction for Performance Monitoring:
rdtsc ; read time stamp mov time, eax ; move counter into variable fdiv ; floating-point divide rdtsc ; read time stamp sub eax, time ; find the difference
This code tries to measure the time it takes to perform a floating-point division by
fdiv will take a long time to complete and, potentially, the second
rdtsc instruction could actually execute before the
fdiv. If this happened, the cycle count will not be the one expected.
Inserting serializing instructions, such
cpuid, which forces every preceding instructions in the code to complete before allowing the program to continue, can keep the
rdtsc instructions from being performed out-of-order. The code using
cpuid for the above example is as follows.
cpuid ; force all previous instructions to complete rdtsc ; read time stamp counter mov time, eax ; move counter into variable fdiv ; floating-point divide cpuid ; wait for FDIV to complete before RDTSC rdtsc ; read time stamp counter sub eax, time ; find the difference
An alternative way is to use
rdtscp which will wait until all previous instructions have been executed before reading the counter. However,
rdtscp is not supported on all CPU models. It is indicated by
CPUID leaf 80000001H, EDX bit 27. If the bit is set to 1 then
rdtscp is present on the processor. For more details, check x86-64 ISA / Assembly Programming References.
There are other cons with
rdtsc used. Here is a list of these concerns combined from Game Timing and Multicore Processors and Time Stamp Counter which together summarize these possible problems quite well.
Discontinuous values. Multiprocessor and dual-core systems do not guarantee synchronization of their cycle counters between cores. This is exacerbated when combined with modern power management technologies that idle and restore various cores at different times, which results in the cores typically being out of synchronization. For an application, this generally results in glitches or in potential crashes as the thread jumps between the processors and gets timing values that result in large deltas, negative deltas, or halted timing.
Variability of the CPU’s frequency. Technology that changes the frequency of the CPU is in use in many high-end desktop PCs. Recent Intel processors include a constant rate TSC. While this makes time keeping more consistent, it can skew benchmarks, where a certain amount of spin-up time is spent at a lower clock rate before the OS switches the processor to the higher rate.
Portability. Reliance on the time stamp counter also reduces portability, as other processors may not have a similar feature.