linux-brain/fs/proc
Yafang Shao 765437d1f0 mm, oom: make the calculation of oom badness more accurate
[ Upstream commit 9066e5cfb73cdbcdbb49e87999482ab615e9fc76 ]

Recently we found an issue on our production environment that when memcg
oom is triggered the oom killer doesn't chose the process with largest
resident memory but chose the first scanned process.  Note that all
processes in this memcg have the same oom_score_adj, so the oom killer
should chose the process with largest resident memory.

Bellow is part of the oom info, which is enough to analyze this issue.
[7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
[7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
[7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
[...]
[7516987.983293] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[7516987.983510] [ 5740]     0  5740      257        1    32768        0          -998 pause
[7516987.983574] [58804]     0 58804     4594      771    81920        0          -998 entry_point.bas
[7516987.983577] [58908]     0 58908     7089      689    98304        0          -998 cron
[7516987.983580] [58910]     0 58910    16235     5576   163840        0          -998 supervisord
[7516987.983590] [59620]     0 59620    18074     1395   188416        0          -998 sshd
[7516987.983594] [59622]     0 59622    18680     6679   188416        0          -998 python
[7516987.983598] [59624]     0 59624  1859266     5161   548864        0          -998 odin-agent
[7516987.983600] [59625]     0 59625   707223     9248   983040        0          -998 filebeat
[7516987.983604] [59627]     0 59627   416433    64239   774144        0          -998 odin-log-agent
[7516987.983607] [59631]     0 59631   180671    15012   385024        0          -998 python3
[7516987.983612] [61396]     0 61396   791287     3189   352256        0          -998 client
[7516987.983615] [61641]     0 61641  1844642    29089   946176        0          -998 client
[7516987.983765] [ 9236]     0  9236     2642      467    53248        0          -998 php_scanner
[7516987.983911] [42898]     0 42898    15543      838   167936        0          -998 su
[7516987.983915] [42900]  1000 42900     3673      867    77824        0          -998 exec_script_vr2
[7516987.983918] [42925]  1000 42925    36475    19033   335872        0          -998 python
[7516987.983921] [57146]  1000 57146     3673      848    73728        0          -998 exec_script_J2p
[7516987.983925] [57195]  1000 57195   186359    22958   491520        0          -998 python2
[7516987.983928] [58376]  1000 58376   275764    14402   290816        0          -998 rosmaster
[7516987.983931] [58395]  1000 58395   155166     4449   245760        0          -998 rosout
[7516987.983935] [58406]  1000 58406 18285584  3967322 37101568        0          -998 data_sim
[7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
[7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

We can find that the first scanned process 5740 (pause) was killed, but
its rss is only one page.  That is because, when we calculate the oom
badness in oom_badness(), we always ignore the negtive point and convert
all of these negtive points to 1.  Now as oom_score_adj of all the
processes in this targeted memcg have the same value -998, the points of
these processes are all negtive value.  As a result, the first scanned
process will be killed.

The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
a Guaranteed pod, which has higher priority to prevent from being killed
by system oom.

To fix this issue, we should make the calculation of oom point more
accurate.  We can achieve it by convert the chosen_point from 'unsigned
long' to 'long'.

[cai@lca.pw: reported a issue in the previous version]
[mhocko@suse.com: fixed the issue reported by Cai]
[mhocko@suse.com: add the comment in proc_oom_score()]
[laoar.shao@gmail.com: v3]
  Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-03 10:08:12 +02:00
..
Kconfig Merge branch 'akpm' (patches from Andrew) 2019-07-17 08:58:04 -07:00
Makefile proc: : uninline name_to_int() 2017-11-17 16:10:00 -08:00
array.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-07-08 16:39:53 -07:00
base.c mm, oom: make the calculation of oom badness more accurate 2021-09-03 10:08:12 +02:00
cmdline.c proc: introduce proc_create_single{,_data} 2018-05-16 07:23:35 +02:00
consoles.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 191 2019-05-30 11:29:21 -07:00
cpuinfo.c x86 / CPU: Always show current CPU frequency in /proc/cpuinfo 2017-11-15 19:46:50 +01:00
devices.c proc: introduce proc_create_seq{,_data} 2018-05-16 07:23:35 +02:00
fd.c proc: use "unsigned int" in proc_fill_cache() 2018-06-07 17:34:38 -07:00
fd.h License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
generic.c proc: fix lookup in /proc/net subdirectories after setns(2) 2021-01-12 20:16:10 +01:00
inode.c proc: Use new_inode not new_inode_pseudo 2020-06-17 16:40:33 +02:00
internal.h proc: fix lookup in /proc/net subdirectories after setns(2) 2021-01-12 20:16:10 +01:00
interrupts.c proc: introduce proc_create_seq{,_data} 2018-05-16 07:23:35 +02:00
kcore.c lockdown: Print current->comm in restriction messages 2019-08-19 21:54:17 -07:00
kmsg.c vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
loadavg.c sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD 2018-10-26 16:26:32 -07:00
meminfo.c proc/meminfo: fix output alignment 2019-10-19 06:32:32 -04:00
namespaces.c procfs: switch instantiate_t to d_splice_alias() 2018-05-26 14:20:50 -04:00
nommu.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
page.c fs/proc/page.c: don't access uninitialized memmaps in fs/proc/page.c 2019-10-19 06:32:31 -04:00
proc_net.c proc: fix lookup in /proc/net subdirectories after setns(2) 2021-01-12 20:16:10 +01:00
proc_sysctl.c proc/sysctl: add shared variables for range check 2019-07-18 17:08:07 -07:00
proc_tty.c tty: replace ->proc_fops with ->proc_show 2018-05-16 07:24:30 +02:00
root.c new helper: get_tree_keyed() 2019-09-05 14:34:22 -04:00
self.c proc: don't allow async path resolution of /proc/self components 2020-12-02 08:49:48 +01:00
softirqs.c proc: introduce proc_create_single{,_data} 2018-05-16 07:23:35 +02:00
stat.c Merge branch 'akpm' (patches from Andrew) 2019-03-06 10:31:36 -08:00
task_mmu.c proc: use untagged_addr() for pagemap_read addresses 2020-12-16 10:56:58 +01:00
task_nommu.c proc: use down_read_killable mmap_sem for /proc/pid/maps 2019-07-12 11:05:46 -07:00
thread_self.c proc: Use new_inode not new_inode_pseudo 2020-06-17 16:40:33 +02:00
uptime.c fs/proc/uptime.c: use ktime_get_boottime_ts64 2018-08-22 10:52:45 -07:00
util.c fs/proc/util.c: include fs/proc/internal.h for name_to_int() 2019-01-04 13:13:45 -08:00
version.c proc: introduce proc_create_single{,_data} 2018-05-16 07:23:35 +02:00
vmcore.c vmalloc: fix remap_vmalloc_range() bounds checks 2020-04-29 16:33:14 +02:00