How does load balancing work inside of operating systems, Linux as an example

Introduction

Load balance is used to rebalance the whole system resources (eg, CPU, memory, etc) so that system performance, scalability (in terms of no matter how many processes contend) and usability (in terms of idle resources can be used up immediately) will be improved. In this article, I mainly present how load balance for CPU reosurces works inside operating systems like Linux Kernel (v4.7.4).

Load Balance Stack Inside Kernel

tick_periodic
->update_process_times
->scheduler_tick
->trigger_load_balance
->raise_softirq
SoftIRQ
-> run_rebalance_domains 
-> rebalance_domains 
-> load_balance

Like above system call stack, it seems that kernel triggers load balance softirq periodically in each tick. However, when looking into the trigger_load_balance function, I find that the periodic period is actually longer.

As shown in the following code snippets, trigger_load_balance checks whether time has passed rq->next_balance jiffies (i.e., if (jiffies – rq->next_balance) >= 0, one jiffy is 4ms is HZ is 250).
In order to find how long rq->next_balance is, I look into rebalance_domains and find that the length of rq->next_balance depends on “interval”:
rq->next_balance is 60s if (sd->last_balance + interval) – next_balance >= 0; otherwise, rq->next_balance is sd->last_balance + interval; here sd->last_balance is system current jiffies.
Then, I find interval is updated by get_sd_balance_interval: 1) if there is an idle CPU, interval is >=20ms and <80ms;
2) if there has no idle CPU, each time interval times busy_facotr (e.g., 32) but no more than 2s (throttled by max_load_balance_interval).

If time passes rq->next_balance, the softirq will start load_balance.

void trigger_load_balance(struct rq *rq)
{
...
    if (time_after_eq(jiffies, rq->next_balance))           ------------> if (jiffies - rq->next_balance) >= 0
        raise_softirq(SCHED_SOFTIRQ);
...
}

static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
{
...
/* Earliest time when we have to do rebalance again */
unsigned long next_balance = jiffies + 60*HZ;                     ------------------> next_balance inits to be after 60s; one jiffy is 4ms if HZ is 250.
...
interval = get_sd_balance_interval(sd, idle != CPU_IDLE);         ------------------> get interval
...
if (time_after(next_balance, sd->last_balance + interval)) {      ------------------> if (sd->last_balance + interval) - next_balance < 0
    next_balance = sd->last_balance + interval;
}
...
if (likely(update_next_balance)) {
    rq->next_balance = next_balance;
}
...
}

static inline unsigned long
get_sd_balance_interval(struct sched_domain *sd, int cpu_busy)
{
    unsigned long interval = sd->balance_interval;          -----------------> got from system cofiguration (i.e., under "/proc/sys/kernel/sched_domain/cpuXX/domainX")

    if (cpu_busy)
        interval *= sd->busy_factor;                    -----------------> got from system cofiguration (i.e., under "/proc/sys/kernel/sched_domain/cpuXX/domainX"); busy_factor is 32 on my sys.   

    /* scale ms to jiffies */
    interval = msecs_to_jiffies(interval);
    interval = clamp(interval, 1UL, max_load_balance_interval);         ---------------> min(max(interval, 1), max_load_balance_interval); max_load_balance_interval inits HZ/10 (e.g., 250/10) and updates as HZ*num_online_cpus()/10 (e.g., 250*80/10 = 2s).

    return interval;
}

static int load_balance(int this_cpu, struct rq *this_rq,
            struct sched_domain *sd, enum cpu_idle_type idle,
            int *continue_balancing)
{
...
    if (likely(!active_balance) || voluntary_active_balance(&env)) {
        /* We were unbalanced, so reset the balancing interval */
        sd->balance_interval = sd->min_interval;                            ---------------> read min_interval (e.g., 20ms) sys config under "/proc/sys/kernel/sched_domain/cpuXX/domainX"
    } else {
        /*
         * If we've begun active balancing, start to back off. This
         * case may not be covered by the all_pinned logic if there
         * is only 1 task on the busy runqueue (because we don't call
         * detach_tasks).
         */
        if (sd->balance_interval < sd->max_interval)                       ---------------> read max_interval (e.g., 40ms) sys config under "/proc/sys/kernel/sched_domain/cpuXX/domainX"
            sd->balance_interval *= 2;
    }
...
}
load_balance (linux/kernel/sched/fair.c)
       --> find_busiest_group
       ---> find_busiest_queue
       ----> nr_running > 1 -> detach_tasks && attach_tasks
       -----> if load_balance here fails, active_balance will be set to wake up Migration Thread on the source CPU

Migration Thread has two key tasks to do:
1, fulfill migration requests originating from the scheduler class;
System call stack:

sched_setaffinity
->__set_cpus_allowed_ptr
->stop_one_cpu
->migration_cpu_stop
->__migrate_task
->move_queued_task
->enqueue_task
->returns the new run queue of destination CPU

2, used to implement active balancing.
System call stack:

load_balance
->stop_one_cpu_nowait (pack into a work and ask migration thread to do)
->active_load_balance_cpu_stop
->detach_one_task
->attach_one_task

Besides, I pick out following comments from kernel to describe how migration thread works. Note that, stopper in kernel is the migration thread on per CPU. You can also look into my previous articles about how migration thread works [5][6].

/*
 * This is how migration works:
 *
 * 1) we invoke migration_cpu_stop() on the target CPU using
 *    stop_one_cpu().
 * 2) stopper starts to run (implicitly forcing the migrated thread
 *    off the CPU)
 * 3) it checks whether the migrated task is still in the wrong runqueue.
 * 4) if it's in the wrong runqueue then the migration thread removes
 *    it and puts it into the right queue.
 * 5) stopper completes and stop_one_cpu() returns and the migration
 *    is done.
 */

Conclusion and Future Work

This article mainly talks about how load balancing works inside Linux Kernel. Actually, there are still two unclear questions: 1, who does (which CPU???) load balancing work? 2, What are the mechanisms/rules for load balancing in Linux Kernel. To be more precise, I mean what kind of status kernel has will be regarded already balancing or not balancing? These two questions will be solved later on.

supplements: for load balance’s periodic entry, if there have idle CPUs, the period is >=20ms and <80ms; if there have no idle CPUs, the period is >= 640ms and <= 2s or 60s.

References

[1] Linux Kernel Development, Robert Love
[2] Understanding the Linux Kernel, Daniel P. Bovet, Marco Cesati
[3] Professional Linux Kernel Architecture, Wolfgang Mauerer
[4] https://elixir.free-electrons.com/linux/v4.7.4/source
[5] https://www.systutorials.com/239971/migration-thread-works-inside-linux-kernel/
[6] https://www.systutorials.com/239953/sched_setaffinity-works-inside-of-linux-kernel/

Leave a Reply

Your email address will not be published. Required fields are marked *