How sched_min_granularity_ns and sched_latency_ns in CFS affect the timeslice of processes

Currently, the most famous process scheduling algorithm in Linux Kernel is Completely Fair Scheduling (CFS) algorithm. The core idea of CFS is to let each process share the same proportional CPU resources to run so that it is fair to each process. In this article, I will introduce how sched_min_granularity_ns and sched_latency_ns work internal CFS to affect the timeslice of processes. (This article discusses Linux Kernel 3.16.x; other versions may have some differences.)


616 static u64 __sched_period(unsigned long nr_running)
617 {
618         u64 period = sysctl_sched_latency;                   /* get from "/proc/sys/kernel/sched_latency_ns" */
619         unsigned long nr_latency = sched_nr_latency;         /* get from "sched_latecy_ns divides sched_min_granularity" */
621         if (unlikely(nr_running > nr_latency)) {          /* if the real processes on the run queue is bigger than nr_latency, that means sched_latency_ns cannot satisfy all the tasks's minimum granularity time requirements. */
622                 period = sysctl_sched_min_granularity;       /* get from "/proc/sys/kernel/sched_min_granularity_ns" */
623                 period *= nr_running;
624         }
626         return period;
627 }
635 static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
636 {
637         u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);
639         for_each_sched_entity(se) {
640                 struct load_weight *load;
641                 struct load_weight lw;
643                 cfs_rq = cfs_rq_of(se);
644                 load = &cfs_rq->load;
646                 if (unlikely(!se->on_rq)) {
647                         lw = cfs_rq->load;
649                         update_load_add(&lw, se->load.weight);
650                         load = &lw;
651                 }
652                 slice = __calc_delta(slice, se->load.weight, load);
653         }
654         return slice;
655 }
2889 /*
2890  * Preempt the current task with a newly woken task if needed:
2891  */
2892 static void
2893 check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
2894 {
2895         unsigned long ideal_runtime, delta_exec;
2896         struct sched_entity *se;
2897         s64 delta;
2899         ideal_runtime = sched_slice(cfs_rq, curr);
2900         delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
2901         if (delta_exec > ideal_runtime) {
2902                 resched_task(rq_of(cfs_rq)->curr);
2903                 /*
2904                  * The current task ran long enough, ensure it doesn't get
2905                  * re-elected due to buddy favours.
2906                  */
2907                 clear_buddies(cfs_rq, curr);
2908                 return;
2909         }
2911         /*
2912          * Ensure that a task that missed wakeup preemption by a
2913          * narrow margin doesn't have to wait for a full slice.
2914          * This also mitigates buddy induced latencies under load.
2915          */
2916         if (delta_exec vruntime - se->vruntime;
2922         if (delta  ideal_runtime)
2926                 resched_task(rq_of(cfs_rq)->curr);
2927 }

In above source codes, check_preempt_tick function calls sched_slice function and sched_slice function calls sched_slice function. Let’s analyse them from sched_slice. In __sched_slice, it gets the time period which can run each task on CPU’s run queue once. In sched_slice, first get the time period which can run each task on the CPU’s run queue once and then, calculate this schedule entity (process)’s ideal time slice to run according to time period, the weight of the schedule entity (process) and the weight of the run queue. For example, assume the time period is 14 ms. If the run queue has two processes with default nice (or priority) value so they have the same weight value (Note that the weight value is calculated by nice value, the lower nice value the bigger weight value). The weight value of the run queue will be 1024 (the weight value of each process for this example) + 1024 (default nice is 0 and its weight value is 1024 correspondingly) so the ideal time slice will be the time period multiply 50% (14 ms * 0.5 = 7 ms).

In check_preempt_tick, it will first get the ideal time slice for current process and then calculate the delta_exec time which is the time that current process has exhausted during this timeslice. Then, current process’s reschedule flag will be set if delta_exec is bigger than ideal_time or sysctl_sched_min_granularity. If not, get the difference, delta, between current process’s vruntime (virtual run time, which is the time the process has been executed on CPU) and the process with minimum vruntime in the runqueue (red black tree); if delta is bigger than ideal_runtime, the current process’s reschedule flag will be set.

Check_preempt_tick is motivated by timer interrupt as follows.


1, If Linux Kernel uses CFS, each process will run at least “sched_min_granularity_ns” timeslice.
2, If “sched_latency_ns / sched_min_granularity_ns” is smaller than the real number of processes on CPU’s run queue, the time period (run each process on the run queue once) will be “the real process number on the run queue multiply sched_min_granularity_ns” rather than sched_latency_ns.

Weiwei Jia

Weiwei Jia is a Ph.D. student in the Department of Computer Science at New Jersey Institute of Technology since 2016. His research interests are include storage systems, operating systems and computer systems.

Leave a Reply

Your email address will not be published. Required fields are marked *