oom, oom_reaper: do not enqueue same task twice [Linux 4.9.156]

This Linux kernel change "oom, oom_reaper: do not enqueue same task twice" is included in the Linux 4.9.156 release. This change is authored by Tetsuo Handa <penguin-kernel [at] I-love.SAKURA.ne.jp> on Fri Feb 1 14:20:31 2019 -0800. The commit for this change in Linux stable tree is 7f38299 (patch) which is from upstream commit 9bcdeb5. The same Linux upstream change may have been applied to various maintained Linux releases and you can find all Linux releases containing changes from upstream 9bcdeb5.

oom, oom_reaper: do not enqueue same task twice

commit 9bcdeb51bd7d2ae9fe65ea4d60643d2aeef5bfe3 upstream.

Arkadiusz reported that enabling memcg's group oom killing causes
strange memcg statistics where there is no task in a memcg despite the
number of tasks in that memcg is not 0.  It turned out that there is a
bug in wake_oom_reaper() which allows enqueuing same task twice which
makes impossible to decrease the number of tasks in that memcg due to a
refcount leak.

This bug existed since the OOM reaper became invokable from
task_will_free_mem(current) path in out_of_memory() in Linux 4.7,

  T1@P1     |T2@P1     |T3@P1     |OOM reaper
  ----------+----------+----------+------------
                                   # Processing an OOM victim in a different memcg domain.
                        try_charge()
                          mem_cgroup_out_of_memory()
                            mutex_lock(&oom_lock)
             try_charge()
               mem_cgroup_out_of_memory()
                 mutex_lock(&oom_lock)
  try_charge()
    mem_cgroup_out_of_memory()
      mutex_lock(&oom_lock)
                            out_of_memory()
                              oom_kill_process(P1)
                                do_send_sig_info(SIGKILL, @P1)
                                mark_oom_victim(T1@P1)
                                wake_oom_reaper(T1@P1) # T1@P1 is enqueued.
                            mutex_unlock(&oom_lock)
                 out_of_memory()
                   mark_oom_victim(T2@P1)
                   wake_oom_reaper(T2@P1) # T2@P1 is enqueued.
                 mutex_unlock(&oom_lock)
      out_of_memory()
        mark_oom_victim(T1@P1)
        wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL.
      mutex_unlock(&oom_lock)
                                   # Completed processing an OOM victim in a different memcg domain.
                                   spin_lock(&oom_reaper_lock)
                                   # T1P1 is dequeued.
                                   spin_unlock(&oom_reaper_lock)

but memcg's group oom killing made it easier to trigger this bug by
calling wake_oom_reaper() on the same task from one out_of_memory()
request.

Fix this bug using an approach used by commit 855b018325737f76 ("oom,
oom_reaper: disable oom_reaper for oom_kill_allocating_task").  As a
side effect of this patch, this patch also avoids enqueuing multiple
threads sharing memory via task_will_free_mem(current) path.

Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp
Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp
Fixes: af8e15cc85a25315 ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Tested-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Aleksa Sarai <asarai@suse.de>
Cc: Jay Kamat <jgkamat@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

There are 5 lines of Linux source code added/deleted in this change. Code changes to Linux kernel are as follows.

 include/linux/sched.h | 1 +
 mm/oom_kill.c         | 4 ++--
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f4a551a..ebd0afb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -527,6 +527,7 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_OOM_SKIP       21  /* mm is of no interest for the OOM killer */
 #define MMF_UNSTABLE       22  /* mm is unstable for copy_from_user */
 #define MMF_HUGE_ZERO_PAGE 23      /* mm has ever used the global huge zero page */
+#define MMF_OOM_REAP_QUEUED    26  /* mm was queued for oom_reaper */

 #define MMF_INIT_MASK      (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1de3695..d514eeb 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -626,8 +626,8 @@ static void wake_oom_reaper(struct task_struct *tsk)
    if (!oom_reaper_th)
        return;

-   /* tsk is already queued? */
-   if (tsk == oom_reaper_list || tsk->oom_reaper_list)
+   /* mm is already queued? */
+   if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
        return;

    get_task_struct(tsk);

Leave a Reply

Your email address will not be published. Required fields are marked *