How sched_min_granularity_ns, sched_latency_ns and sched_wakeup_granularity_ns in CFS affect the timeslice of processes

Abstract
Currently, the most famous process scheduling algorithm in Linux Kernel is Completely Fair Scheduling (CFS) algorithm. The core idea of CFS is to let each process share the same proportional CPU resources to run so that it is fair to each process. In this article, I will introduce how sched_min_granularity_ns and sched_latency_ns work internal CFS to affect the timeslice of processes. (This article discusses Linux Kernel 3.16.x; other versions may have some differences.)

Details

616 static u64 __sched_period(unsigned long nr_running)
617 {
618         u64 period = sysctl_sched_latency;                   /* get from "/proc/sys/kernel/sched_latency_ns" */
619         unsigned long nr_latency = sched_nr_latency;         /* get from "sched_latecy_ns divides sched_min_granularity" */
620 
621         if (unlikely(nr_running > nr_latency)) {          /* if the real processes on the run queue is bigger than nr_latency, that means sched_latency_ns cannot satisfy all the tasks's minimum granularity time requirements. */
622                 period = sysctl_sched_min_granularity;       /* get from "/proc/sys/kernel/sched_min_granularity_ns" */
623                 period *= nr_running;
624         }
625 
626         return period;
627 }
635 static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
636 {
637         u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);
638 
639         for_each_sched_entity(se) {
640                 struct load_weight *load;
641                 struct load_weight lw;
642 
643                 cfs_rq = cfs_rq_of(se);
644                 load = &cfs_rq->load;
645 
646                 if (unlikely(!se->on_rq)) {
647                         lw = cfs_rq->load;
648 
649                         update_load_add(&lw, se->load.weight);
650                         load = &lw;
651                 }
652                 slice = __calc_delta(slice, se->load.weight, load);
653         }
654         return slice;
655 }
2889 /*
2890  * Preempt the current task with a newly woken task if needed:
2891  */
2892 static void
2893 check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
2894 {
2895         unsigned long ideal_runtime, delta_exec;
2896         struct sched_entity *se;
2897         s64 delta;
2898 
2899         ideal_runtime = sched_slice(cfs_rq, curr);
2900         delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
2901         if (delta_exec > ideal_runtime) {
2902                 resched_task(rq_of(cfs_rq)->curr);
2903                 /*
2904                  * The current task ran long enough, ensure it doesn't get
2905                  * re-elected due to buddy favours.
2906                  */
2907                 clear_buddies(cfs_rq, curr);
2908                 return;
2909         }
2910 
2911         /*
2912          * Ensure that a task that missed wakeup preemption by a
2913          * narrow margin doesn't have to wait for a full slice.
2914          * This also mitigates buddy induced latencies under load.
2915          */
2916         if (delta_exec vruntime - se->vruntime;
2921 
2922         if (delta  ideal_runtime)
2926                 resched_task(rq_of(cfs_rq)->curr);
2927 }

In above source codes, check_preempt_tick function calls sched_slice function and sched_slice function calls sched_slice function. Let’s analyse them from sched_slice. In __sched_slice, it gets the time period which can run each task on CPU’s run queue once. In sched_slice, first get the time period which can run each task on the CPU’s run queue once and then, calculate this schedule entity (process)’s ideal time slice to run according to time period, the weight of the schedule entity (process) and the weight of the run queue. For example, assume the time period is 14 ms. If the run queue has two processes with default nice (or priority) value so they have the same weight value (Note that the weight value is calculated by nice value, the lower nice value the bigger weight value). The weight value of the run queue will be 1024 (the weight value of each process for this example) + 1024 (default nice is 0 and its weight value is 1024 correspondingly) so the ideal time slice will be the time period multiply 50% (14 ms * 0.5 = 7 ms).

In check_preempt_tick, it will first get the ideal time slice for current process and then calculate the delta_exec time which is the time that current process has exhausted during this timeslice. Then, current process’s reschedule flag will be set if delta_exec is bigger than ideal_time or sysctl_sched_min_granularity. If not, get the difference, delta, between current process’s vruntime (virtual run time, which is the time the process has been executed on CPU) and the process with minimum vruntime in the runqueue (red black tree); if delta is bigger than ideal_runtime, the current process’s reschedule flag will be set.

Check_preempt_tick is motivated by timer interrupt as follows.

tick_handle_periodic
 -tick_periodic
  --update_process_times
   ---scheduler_tick
    ----task_tick_fair
     -----entity_tick
      ------check_preempt_tick

sched_wakeup_granularity_ns will affect inter-processes preemption. In the following code snippets, you will see only when the difference between the virtual run time of current running process and the virtual run time of preempting process is bigger than the virtual run time of sched_wakeup_granularity_ns (here, transfer sched_wakeup_granularity_ns to a virtual run time with preempting process’s weight).

4604 /*
4605  * Should 'se' preempt 'curr'.
4606  *
4607  *             |s1
4608  *        |s2
4609  *   |s3
4610  *         g
4611  *      ||c
4612  *
4613  *  w(c, s1) = -1
4614  *  w(c, s2) =  0
4615  *  w(c, s3) =  1
4616  *
4617  */
4618 static int
4619 wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
4620 {
4621         s64 gran, vdiff = curr->vruntime - se->vruntime;
4622 
4623         if (vdiff  gran)
4628                 return 1;
4629 
4630         return 0;
4631 }
4657 /*
4658  * Preempt the current task with a newly woken task if needed:
4659  */
4660 static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
4661 {
...
4713         if (wakeup_preempt_entity(se, pse) == 1) {
4714                 /*
4715                  * Bias pick_next to pick the sched entity that is
4716                  * triggering this preemption.
4717                  */
4718                 if (!next_buddy_marked)
4719                         set_next_buddy(pse);
4720                 goto preempt;
4721         }
4722 
4723         return;
4724 
4725 preempt:
4726         resched_task(curr);
...
4741 }

There is a simple example for sched_wakeup_granularity_ns as follows. When I set sched_wakeup_granularity_ns to be 0, processes will be preempted each other very frequently and it shows that it is better for I/O intensive workload. I just dump the Kernel stack as follows when there is a preemption.

236207 Dec 10 19:15:25 mobius04 kernel: [ 3976.315960] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~.
236208 Dec 10 19:15:25 mobius04 kernel: [ 3976.315961] check_preempt_wakeup: current time stamp (microseconds) is 1481415325312148
236209 Dec 10 19:15:25 mobius04 kernel: [ 3976.315962] check_preempt_wakeup: thread 5688 will preempt current thread 5719
236210 Dec 10 19:15:25 mobius04 kernel: [ 3976.315962]
236211 Dec 10 19:15:25 mobius04 kernel: [ 3976.315963] CPU: 5 PID: 5684 Comm: qemu-system-x86 Not tainted 3.16.39 #21
236212 Dec 10 19:15:25 mobius04 kernel: [ 3976.315964] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.2.6 06/08/2015
236213 Dec 10 19:15:25 mobius04 kernel: [ 3976.315965]  0000000000000000 ffff8813bc74bbe8 ffffffff8176b2f0 ffff8813bd0d9400
236214 Dec 10 19:15:25 mobius04 kernel: [ 3976.315966]  ffff8813bc78aa00 ffff8813bc74bc48 ffffffff810b2054 ffff8813bc74bc18
236215 Dec 10 19:15:25 mobius04 kernel: [ 3976.315968]  00000000000ae45d 00000000584c9a9d 000000000004c354 ffff8814015b8a40
236216 Dec 10 19:15:25 mobius04 kernel: [ 3976.315970] Call Trace:
236217 Dec 10 19:15:25 mobius04 kernel: [ 3976.315971]  [] dump_stack+0x64/0x84
236218 Dec 10 19:15:25 mobius04 kernel: [ 3976.315973]  [] check_preempt_wakeup+0x284/0x2b0
236219 Dec 10 19:15:25 mobius04 kernel: [ 3976.315980]  [] check_preempt_curr+0x84/0xa0
236220 Dec 10 19:15:25 mobius04 kernel: [ 3976.315987]  [] ttwu_do_wakeup+0x1d/0xf0
236221 Dec 10 19:15:25 mobius04 kernel: [ 3976.315994]  [] T.2622+0x4a/0x60
236222 Dec 10 19:15:25 mobius04 kernel: [ 3976.316001]  [] try_to_wake_up+0x299/0x370
236223 Dec 10 19:15:25 mobius04 kernel: [ 3976.316008]  [] ? __pollwait+0xf0/0xf0
236224 Dec 10 19:15:25 mobius04 kernel: [ 3976.316015]  [] wake_up_state+0x10/0x20
236225 Dec 10 19:15:25 mobius04 kernel: [ 3976.316022]  [] wake_futex+0x76/0x90
236226 Dec 10 19:15:25 mobius04 kernel: [ 3976.316029]  [] futex_wake+0x12f/0x160
236227 Dec 10 19:15:25 mobius04 kernel: [ 3976.316035]  [] do_futex+0x119/0xd20
236228 Dec 10 19:15:25 mobius04 kernel: [ 3976.316043]  [] ? __dequeue_entity+0x30/0x50
236229 Dec 10 19:15:25 mobius04 kernel: [ 3976.316051]  [] ? eventfd_ctx_read+0x6e/0x2e0
236230 Dec 10 19:15:25 mobius04 kernel: [ 3976.316057]  [] SyS_futex+0x7d/0x170
236231 Dec 10 19:15:25 mobius04 kernel: [ 3976.316064]  [] ? SyS_read+0x9c/0xd0
236232 Dec 10 19:15:25 mobius04 kernel: [ 3976.316067]  [] system_call_fastpath+0x1a/0x1f
236233 Dec 10 19:15:25 mobius04 kernel: [ 3976.316068] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~.

Conclusion
1, If Linux Kernel uses CFS, each process will run at least “sched_min_granularity_ns” timeslice.
2, If “sched_latency_ns / sched_min_granularity_ns” is smaller than the real number of processes on CPU’s run queue, the time period (run each process on the run queue once) will be “the real process number on the run queue multiply sched_min_granularity_ns” rather than sched_latency_ns.
3, If the difference between the virtual run time of current running process and the virtual run time of preempting process is bigger than the virtual run time of sched_wakeup_granularity_ns (here, transfer sched_wakeup_granularity_ns to a virtual run time with preempting process’s weight), the preemption happens.

  •  
  •  
  •  
  •  
  •  
  •  
Weiwei Jia is a Ph.D. student in the Department of Computer Science at New Jersey Institute of Technology since 2016. His research interests are include storage systems, operating systems and computer systems.

Leave a Reply

Your email address will not be published. Required fields are marked *