Commit Graph

14265 Commits

Author SHA1 Message Date
Andrey Zhizhikin 8b231e0e8e This is the 5.4.148 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmFLBPMACgkQONu9yGCS
 aT6BIQ//Wb4ZQJtEVvaKnda7vFwe8BoZzPGYZA4Imn9KERDRgHuavEuRfMQtKc2y
 YHwe/PD2JreuDHcd+Wz32xsdMe045xNvgiE1oGcxq0jNBvhJqANSmVTWpdqAquON
 cTmwsK3roa7ELC2g1WjrYZDv6CrCggqvbuM9AJ/cLITtd8zerhLdZo+CCDG/28cH
 EosrWvkBcaGmX+r/IBC86Rt6K2OFQ/3LLbb79L4vjKi5lopsm5CTAmfOfIk8p1gB
 mGB3PkQZnIqphBfqGXLGuljl4e+zb1SONrugUh78Egom393Ex34oo+RjWEGe9dV2
 Stkuqo0GTi85X7JA7SGCA/xgF8A8yvaaLjQBsJsL9+2ji+GW+J7hfn4mE5h8H3Di
 UBjeLMFJA8Mge8Ng9xUSttvjRdwSTm0jWTS9SOl07w24b0pKYbMrQdWt2eI6CT+/
 ytq3nCxNJZKeVcAVH+OJNrbSLYvMy/PgYvGTbzASkNmpAeyNiHOyBz1sRcoiAM9U
 QCWDdZyaqDKktqEyKHxK3opqPzbnHfZFFlCxR7Gw7vvR+itIGJEh/50RNv2F6vnu
 wzowrVxe+Bf1h7JiNEqLLVHdiuygRqjH1ygepGM4+3TVF4jYHzDISyrqlA/Se3Pg
 Hhvlzsbv7PH+KiApwBFjSeHTs5WOrokGMFQ7ZYFDpPkleWiywS0=
 =50Hk
 -----END PGP SIGNATURE-----

Merge tag 'v5.4.148' into 5.4-2.3.x-imx

This is the 5.4.148 stable release

Conflicts:
- drivers/dma/imx-sdma.c:
Following upstream patches are already applied to NXP tree:
7cfbf391e8 ("dmaengine: imx-sdma: remove duplicated sdma_load_context")
788122c99d ("Revert "dmaengine: imx-sdma: refine to load context only
once"")

- drivers/usb/chipidea/host.c:
Merge upstream commit a18cfd715e ("usb: chipidea: host: fix port index
underflow and UBSAN complains") to NXP version.

Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>
2021-09-22 13:19:32 +00:00
David Hildenbrand 8656566821 mm/memory_hotplug: use "unsigned long" for PFN in zone_for_pfn_range()
commit 7cf209ba8a86410939a24cb1aeb279479a7e0ca6 upstream.

Patch series "mm/memory_hotplug: preparatory patches for new online policy and memory"

These are all cleanups and one fix previously sent as part of [1]:
[PATCH v1 00/12] mm/memory_hotplug: "auto-movable" online policy and memory
groups.

These patches make sense even without the other series, therefore I pulled
them out to make the other series easier to digest.

[1] https://lkml.kernel.org/r/20210607195430.48228-1-david@redhat.com

This patch (of 4):

Checkpatch complained on a follow-up patch that we are using "unsigned"
here, which defaults to "unsigned int" and checkpatch is correct.

As we will search for a fitting zone using the wrong pfn, we might end
up onlining memory to one of the special kernel zones, such as ZONE_DMA,
which can end badly as the onlined memory does not satisfy properties of
these zones.

Use "unsigned long" instead, just as we do in other places when handling
PFNs.  This can bite us once we have physical addresses in the range of
multiple TB.

Link: https://lkml.kernel.org/r/20210712124052.26491-2-david@redhat.com
Fixes: e5e6893026 ("mm, memory_hotplug: display allowed zones in the preferred ordering")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: virtualization@lists.linux-foundation.org
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pierre Morel <pmorel@linux.ibm.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Sergei Trofimovich <slyfox@gentoo.org>
Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-22 12:26:43 +02:00
Rik van Riel d0ddb80bbf mm,vmscan: fix divide by zero in get_scan_count
commit 32d4f4b782bb8f0ceb78c6b5dc46eb577ae25bf7 upstream.

Commit f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to
proportional memory.low reclaim") introduced a divide by zero corner
case when oomd is being used in combination with cgroup memory.low
protection.

When oomd decides to kill a cgroup, it will force the cgroup memory to
be reclaimed after killing the tasks, by writing to the memory.max file
for that cgroup, forcing the remaining page cache and reclaimable slab
to be reclaimed down to zero.

Previously, on cgroups with some memory.low protection that would result
in the memory being reclaimed down to the memory.low limit, or likely
not at all, having the page cache reclaimed asynchronously later.

With f56ce412a59d the oomd write to memory.max tries to reclaim all the
way down to zero, which may race with another reclaimer, to the point of
ending up with the divide by zero below.

This patch implements the obvious fix.

Link: https://lkml.kernel.org/r/20210826220149.058089c6@imladris.surriel.com
Fixes: f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim")
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-22 12:26:37 +02:00
Andrey Zhizhikin e4dba8435a This is the 5.4.145 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmE9pNwACgkQONu9yGCS
 aT5g2Q//ZEiIQBvw6uZEA3z1Y6tuKFyIxxwOu24EbCTvB1oXXQX/XXQQPWori1Ny
 OpjuQwXJr2LW+/wEKvUEj8mTrpFD+LsZmXRLBHCw9EqqD5RURDqUZt+xh+4xtV/S
 2EgF0nCO9+84wo628Lc1C2LBJZZEo/kD7LnGeln+BXwRS1FQvGfD+5KIpOR2YzqI
 hrCtVfO5ZQpv59PrAQkfwnfITk9BM1cwA7LCD75WEN59405ZV3mzFyTdvz8s0iGt
 akxmwajLNGjQ/ro567tjpsWiK7EF26mNRTMZqu1jK6h/KjU9sQ4DzCqB+p5TPh/9
 mj/Rzq1lSjLodsR0OznKBqFIVaqXyTU+0cMItjos9MBsG/4GOj8ixbXdFRG99WmK
 bNsYucotSrE9ApYwsmqYaNcHcGeLUIsYCDFCQp3++oeF59+FA7Pp7B4bI/zcYRwY
 aqbfTkMzo8/e4OF0B2LCx+8r0xol1SoLwBfcP3hb7rlKp9OkSYsKrJ/29CUuINe1
 YC5HdrPf2HP36jlVCll5rQa+ERaxtNSCozgwxHG/x2yeOmiVqxdE+vUUmyRidah8
 DvYklCM7upUDi1ujbOwbor9R1jQSXkWMFK76EBB3GJPgguFNyczFXm8xBzfRLQvw
 H6YjIfnxNt+DLPn5uXIEhU7ISTkUno9i1BEd2NoeT1UiYTlk2bw=
 =/lic
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmE+VUgACgkQ7G51OISz
 Hs0D+A//c+QInKUmZ5jcHXXqaiD0JYt5+arGOIBAgIjRcWDL3yJpuuP5nwZRnuRf
 N0SyJjZIpb/WyS27zEaHwOg7xKkhCqagix/J9qdNDP94W0xoGKdXsxotnixXXjMo
 tY9RUqwA45i412x6RA4sXHBmyhgPwSuoLnHk+qVF4xOBD2rshVXzv6zQAO51Lg4J
 3/a2agXlrreCGtHmnKwn47pW+mkHEaquCmcXaz/6PNbWDHwZpmmtCRlsvMW1PyMe
 7TZ/mQ3fofSX3bxuzzYWmddcGTNtwe/IsmEFZmhC3Vl4BGtqE2YKj3439zca1572
 6ttLDy9K3pAg/rf0NhIe7nIlogmJyunP6U8CVnuAwH0esixWxyXSSXUKTOXhKYFG
 jgv/+adhqjSZQhm1p1HY8KItsJE8o5m87b2UKMe1UKGExAJ66qJsgZtuufbEDA5r
 S+qnQEzE4lfaYmCH6WBtn6a1fodONw/KmsJwk5UhNlv2BVS8LQpcpADVRmq/mCDj
 FpCe7uFJVaDnxpUc1Trv6D6YJpL6n1KIUQcVyExtmBtpuKK/KTLs+NP5hpUZ9xtV
 ftAEiZkplJP+qACdCeZt5QukiHBIfds32sUZ4PDp3ANbMmY6UpTWGQyZq+7PNL4d
 +OzCCDzlXRdZioua6z8oTaAkabo2pcK5ao7GXSor35mU6JZvMVg=
 =Z0Kz
 -----END PGP SIGNATURE-----

Merge tag 'v5.4.145' into 5.4-2.3.x-imx

This is the 5.4.145 stable release

Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>
2021-09-12 19:30:13 +00:00
Muchun Song cf1222b877 mm/page_alloc: speed up the iteration of max_order
commit 7ad69832f37e3cea8557db6df7c793905f1135e8 upstream.

When we free a page whose order is very close to MAX_ORDER and greater
than pageblock_order, it wastes some CPU cycles to increase max_order to
MAX_ORDER one by one and check the pageblock migratetype of that page
repeatedly especially when MAX_ORDER is much larger than pageblock_order.

We also should not be checking migratetype of buddy when "order ==
MAX_ORDER - 1" as the buddy pfn may be invalid, so adjust the condition.
With the new check, we don't need the max_order check anymore, so we
replace it.

Also adjust max_order initialization so that it's lower by one than
previously, which makes the code hopefully more clear.

Link: https://lkml.kernel.org/r/20201204155109.55451-1-songmuchun@bytedance.com
Fixes: d9dddbf556 ("mm/page_alloc: prevent merging between isolated and other pageblocks")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-12 08:56:41 +02:00
Andrey Zhizhikin 79c30f58eb This is the 5.4.144 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmEx2AAACgkQONu9yGCS
 aT7csg//ZhXXfRkPNMhpkkMjcV7F825mLAPs1vsluIEIZ0oInOpegu8SyDENOfui
 HyFLZ/2Stewa0mn7kNS1caAUXLpFvZ087sIz/SipzupFjLTUHFsNcMYrd19R1M4h
 UK/owAJeoq/pgR4kUck4o/r+47lo8CMqkscbEdKSvwxYUeANIcbGVB5Sf2UaJr5S
 lqBZeliWY/jYGvLWBoSc7mvUwWRbkKLnQu2JkfvGKM4ODOzpbh8TUhq8NxEL7ZFn
 mZxtNmWPvG2PHHvNP89pwKnKQx70ySKrlQdDv10gL6nIHhKuqwLxBo28Q+KcKMYr
 vfoOFS5Vk35jA7Xt8LhNF+lQtDTbN+2YLeDtoAq+aWMmEW/RUYXSU/3thh+WFuO5
 uZZAbrh4r3bew+PLFpEtnVjxkpMsU9EC33KuIZXIGlDEkFlEneJ9pMQYH7XIwQnV
 5sSSOnbyzkajxv9Kpu6XEg3kKyJf+gk/AB/psgfMR0v/jQ4PXVk9+cZDZxKFcxjj
 wGywDkgIb+/sPrABWici/yXjIup0OSG1fK9/Ki9uLgNzxXZ0h+4e3DcXNMxs1B/p
 GpBPP773qIff2lEDhAI+SbP8pHj5Mnc1j77WUQTU9vsIJcftYm4i0G+POpXnynzx
 gzbjJjOhTBL57OciLQlmL2s5ZZUPgPvu5VoHsRfwOu/bbarRADE=
 =RA6W
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmEx8s8ACgkQ7G51OISz
 Hs1JZA/8Cj7g56QymgMuXHEB1PecU7pLpO5egRK3X6xHxJwksD7Xp2LfpaRxjzGw
 XNQsp+4mbJX4oHiZPjD/RsFOdVuNU3ff3mliSmoH2Tdepa2TuKFt7T8V3GE7FN6K
 ns52rvIzbhF762nL1Vs+LE0YBq1w6rTvL7eenNxMo9pwUxJv95X91v7BpRQjTAY5
 /ngvj8tRKN10dSokwrCpzk47Sj/jhSoLlckJL7+iOopQdhOo/HTfWj1aPCaZC/AX
 q2EUg/L2GB1Ij342lDNEZSWn2xAvuAT6+45R8p3GxyG6TMihwiKGXQM922MJDZAV
 T3Chxgu//OlB/spPMAuFgfBNqaX1z+zxv3Dc1EvEbSNPhn6PwEZ2ck9hYkuPmvI3
 78dkyqj3x3AR5VKvc/CpnqSokXBjV7B1TOxJlHKvJ77lvWuDwujir+chmULjahA8
 bVPpbBC9BfF/nX0cYsjQuDNyddqTpt3cv1Cp9w5gXhs/Nj5MsNDRyZxaVHlGaI/W
 h3N6rAU2cNDDtI4Zqr8Lo5IgBLMVUPuj9ZUNUJBKq3YX5CooEmjCtBKZch55Ou5h
 6xmcaMgrFre3FHKfvVhJ5ACK/DoPuWLvr4Af4Q0v6kgif81Is4LQQ+EEXRLMryi2
 fTv2X5r2GLoMHlUH0WgBW8pY0NuoCJiZuxCY5T1c61pCbDCpRss=
 =yagG
 -----END PGP SIGNATURE-----

Merge tag 'v5.4.144' into 5.4-2.3.x-imx

This is the 5.4.144 stable release

Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>
2021-09-03 10:02:52 +00:00
Yafang Shao 765437d1f0 mm, oom: make the calculation of oom badness more accurate
[ Upstream commit 9066e5cfb73cdbcdbb49e87999482ab615e9fc76 ]

Recently we found an issue on our production environment that when memcg
oom is triggered the oom killer doesn't chose the process with largest
resident memory but chose the first scanned process.  Note that all
processes in this memcg have the same oom_score_adj, so the oom killer
should chose the process with largest resident memory.

Bellow is part of the oom info, which is enough to analyze this issue.
[7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
[7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
[7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
[...]
[7516987.983293] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[7516987.983510] [ 5740]     0  5740      257        1    32768        0          -998 pause
[7516987.983574] [58804]     0 58804     4594      771    81920        0          -998 entry_point.bas
[7516987.983577] [58908]     0 58908     7089      689    98304        0          -998 cron
[7516987.983580] [58910]     0 58910    16235     5576   163840        0          -998 supervisord
[7516987.983590] [59620]     0 59620    18074     1395   188416        0          -998 sshd
[7516987.983594] [59622]     0 59622    18680     6679   188416        0          -998 python
[7516987.983598] [59624]     0 59624  1859266     5161   548864        0          -998 odin-agent
[7516987.983600] [59625]     0 59625   707223     9248   983040        0          -998 filebeat
[7516987.983604] [59627]     0 59627   416433    64239   774144        0          -998 odin-log-agent
[7516987.983607] [59631]     0 59631   180671    15012   385024        0          -998 python3
[7516987.983612] [61396]     0 61396   791287     3189   352256        0          -998 client
[7516987.983615] [61641]     0 61641  1844642    29089   946176        0          -998 client
[7516987.983765] [ 9236]     0  9236     2642      467    53248        0          -998 php_scanner
[7516987.983911] [42898]     0 42898    15543      838   167936        0          -998 su
[7516987.983915] [42900]  1000 42900     3673      867    77824        0          -998 exec_script_vr2
[7516987.983918] [42925]  1000 42925    36475    19033   335872        0          -998 python
[7516987.983921] [57146]  1000 57146     3673      848    73728        0          -998 exec_script_J2p
[7516987.983925] [57195]  1000 57195   186359    22958   491520        0          -998 python2
[7516987.983928] [58376]  1000 58376   275764    14402   290816        0          -998 rosmaster
[7516987.983931] [58395]  1000 58395   155166     4449   245760        0          -998 rosout
[7516987.983935] [58406]  1000 58406 18285584  3967322 37101568        0          -998 data_sim
[7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
[7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

We can find that the first scanned process 5740 (pause) was killed, but
its rss is only one page.  That is because, when we calculate the oom
badness in oom_badness(), we always ignore the negtive point and convert
all of these negtive points to 1.  Now as oom_score_adj of all the
processes in this targeted memcg have the same value -998, the points of
these processes are all negtive value.  As a result, the first scanned
process will be killed.

The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
a Guaranteed pod, which has higher priority to prevent from being killed
by system oom.

To fix this issue, we should make the calculation of oom point more
accurate.  We can achieve it by convert the chosen_point from 'unsigned
long' to 'long'.

[cai@lca.pw: reported a issue in the previous version]
[mhocko@suse.com: fixed the issue reported by Cai]
[mhocko@suse.com: add the comment in proc_oom_score()]
[laoar.shao@gmail.com: v3]
  Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-03 10:08:12 +02:00
Andrey Zhizhikin 49ef8ef4cd Linux 5.4.143
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE4n5dijQDou9mhzu83qZv95d3LNwFAmEnj2AACgkQ3qZv95d3
 LNxNEQ//auFOSmgsMtI8LDmKlP/f22+FmICk8+IHeBMRBMDY0WGEEdsRZgcf4R7M
 hgyBn8ISmU5W0idpoxzVTiNxDJ0YVbVSIX12lZO6OHnwcv6hNW7iOW5TaGjd8EN+
 fkh8MtAToBQrp4fFb1QkC11pYNMPiuvDNB2nW+F3ixfYLyC1EF4g2/qVUKy7s6rZ
 dbqDfuI3Q7R2opsIkpmsV7ClKGbJzsP7oo0H5EOQMpmOowhg3oJy8oYqMMTgij1T
 bJU8kujNElsK+/nbpVzJPrpprQH9eGP+hB5ZAv6s/FuJ6RmkoAczYQnX3HL6TfCS
 ymoyJ01gsmDic9RnG6qei5LkCwf5Td2SKjRZdqGWKTluWD1ZAwzUX8Ww6K+t5uWk
 PQPyCfU2wk2D3JjJWt0vTxl/GZGAkYbZpy5ISZFJhK7/j9/oTSrPWra7/BRu4K2I
 2PK7XGjNyQxSguQqmG064Q0nYEOU03pR2H8tyG3iH0nBBd9p54D0Bg0D73I2h0az
 PoGhBo71m9SYCPP1zSXl+xLFyWGZZDUYCaU9KPlwkYCCcRUSQbfCKwrYHfEcHZgL
 4QtYlpUi+/C0Ga7gAK9ierqCKSNTOpoVna618j97uqCYVIU8estLBqX4mMAQquVF
 R8+cy6L/aTBVw4Zwd0Jmt85GwBHlHahUGEq87+Qpw/laqjkBFcg=
 =SIVq
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmEorqsACgkQ7G51OISz
 Hs3MNQ/9GOK/4DrXYH4RTeGJG2+tTk8VMOrmoDxQOkGXggLvpw06gP1Q9Dz8qzee
 hkILNVr4xC1qxzBzCIAOwzSM0DOQeGJrjtlDPOHxFB6akmwfZ9mU7W4k3YaBHz2c
 +pp4YM0YalmVoJSDBVyFrN67q8gorK39yPgi3BC2tUAz9OSJfPsmdKwqe8ICe7Lp
 hq7R8VfeZdcmYW2vRF5v2yOlzg9vlEd2JfXVL+LJ3R9Eo2Ytlam3gaeObJJqtjDw
 MObvvSLeG2QjZ38tvrWjudfR1Z0hDFy/E1E1AI4y9STLHHQj0eM+3dzEP1mugVVk
 bvLeas2Raf8IA0tJfNQIgz54DcrCT7vHGblKkqESg6kJ/fVc+7CxaLf72DdU0KrN
 0IzRHu3TX8LCj1BQax6eDWlysa3k837WkRMLK55NKPhqkDCb3IhpI3YB+rVgQUoa
 DpjmZ+sTDPypasZf2rGTdGosZBccsRdNbohmOuR202rKOeMnndGJRQrAwdYym/WO
 C9pKosNwBCFk4I97chihvamN/vH0MEJkICwR858I6+uCAMPISNqYcSbRYC3YXoyy
 VU0JBrp4hM5IlKvxKoRuT/Pplt54xhoWREXOS1ua+BOhcWox85QYt+d7uLxt02iZ
 4u9IEXwy0QM+qxZaWw+9+IP4DfOgVa3BAsrVzRfAoHrbwMte7xY=
 =W2Mz
 -----END PGP SIGNATURE-----

Merge tag 'v5.4.143' into 5.4-2.3.x-imx

Linux 5.4.143

Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>
2021-08-27 09:21:44 +00:00
Johannes Weiner 41c7f46c89 mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim
[ Upstream commit f56ce412a59d7d938b81de8878faef128812482c ]

We've noticed occasional OOM killing when memory.low settings are in
effect for cgroups.  This is unexpected and undesirable as memory.low is
supposed to express non-OOMing memory priorities between cgroups.

The reason for this is proportional memory.low reclaim.  When cgroups
are below their memory.low threshold, reclaim passes them over in the
first round, and then retries if it couldn't find pages anywhere else.
But when cgroups are slightly above their memory.low setting, page scan
force is scaled down and diminished in proportion to the overage, to the
point where it can cause reclaim to fail as well - only in that case we
currently don't retry, and instead trigger OOM.

To fix this, hook proportional reclaim into the same retry logic we have
in place for when cgroups are skipped entirely.  This way if reclaim
fails and some cgroups were scanned with diminished pressure, we'll try
another full-force cycle before giving up and OOMing.

[akpm@linux-foundation.org: coding-style fixes]

Link: https://lkml.kernel.org/r/20210817180506.220056-1-hannes@cmpxchg.org
Fixes: 9783aa9917 ("mm, memcg: proportional memory.{low,min} reclaim")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Leon Yang <lnyng@fb.com>
Reviewed-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>		[5.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-26 08:36:22 -04:00
Yafang Shao 1a3aa81444 mm, memcg: avoid stale protection values when cgroup is above protection
[ Upstream commit 22f7496f0b901249f23c5251eb8a10aae126b909 ]

Patch series "mm, memcg: memory.{low,min} reclaim fix & cleanup", v4.

This series contains a fix for a edge case in my earlier protection
calculation patches, and a patch to make the area overall a little more
robust to hopefully help avoid this in future.

This patch (of 2):

A cgroup can have both memory protection and a memory limit to isolate it
from its siblings in both directions - for example, to prevent it from
being shrunk below 2G under high pressure from outside, but also from
growing beyond 4G under low pressure.

Commit 9783aa9917 ("mm, memcg: proportional memory.{low,min} reclaim")
implemented proportional scan pressure so that multiple siblings in excess
of their protection settings don't get reclaimed equally but instead in
accordance to their unprotected portion.

During limit reclaim, this proportionality shouldn't apply of course:
there is no competition, all pressure is from within the cgroup and should
be applied as such.  Reclaim should operate at full efficiency.

However, mem_cgroup_protected() never expected anybody to look at the
effective protection values when it indicated that the cgroup is above its
protection.  As a result, a query during limit reclaim may return stale
protection values that were calculated by a previous reclaim cycle in
which the cgroup did have siblings.

When this happens, reclaim is unnecessarily hesitant and potentially slow
to meet the desired limit.  In theory this could lead to premature OOM
kills, although it's not obvious this has occurred in practice.

Workaround the problem by special casing reclaim roots in
mem_cgroup_protection.  These memcgs are never participating in the
reclaim protection because the reclaim is internal.

We have to ignore effective protection values for reclaim roots because
mem_cgroup_protected might be called from racing reclaim contexts with
different roots.  Calculation is relying on root -> leaf tree traversal
therefore top-down reclaim protection invariants should hold.  The only
exception is the reclaim root which should have effective protection set
to 0 but that would be problematic for the following setup:

 Let's have global and A's reclaim in parallel:
  |
  A (low=2G, usage = 3G, max = 3G, children_low_usage = 1.5G)
  |\
  | C (low = 1G, usage = 2.5G)
  B (low = 1G, usage = 0.5G)

 for A reclaim we have
 B.elow = B.low
 C.elow = C.low

 For the global reclaim
 A.elow = A.low
 B.elow = min(B.usage, B.low) because children_low_usage <= A.elow
 C.elow = min(C.usage, C.low)

 With the effective values resetting we have A reclaim
 A.elow = 0
 B.elow = B.low
 C.elow = C.low

 and global reclaim could see the above and then
 B.elow = C.elow = 0 because children_low_usage > A.elow

Which means that protected memcgs would get reclaimed.

In future we would like to make mem_cgroup_protected more robust against
racing reclaim contexts but that is likely more complex solution than this
simple workaround.

[hannes@cmpxchg.org - large part of the changelog]
[mhocko@suse.com - workaround explanation]
[chris@chrisdown.name - retitle]

Fixes: 9783aa9917 ("mm, memcg: proportional memory.{low,min} reclaim")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Chris Down <chris@chrisdown.name>
Acked-by: Roman Gushchin <guro@fb.com>
Link: http://lkml.kernel.org/r/cover.1594638158.git.chris@chrisdown.name
Link: http://lkml.kernel.org/r/044fb8ecffd001c7905d27c0c2ad998069fdc396.1594638158.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-26 08:36:22 -04:00
Andrey Zhizhikin 90c98361bb This is the 5.4.135 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmD9WokACgkQONu9yGCS
 aT5RTg/+KOmvPPq4DTSRwQqC7Zk1TzPUQ38H2iZxgpISds7Y0S3RKFmJvXcRoxe2
 z0y6b1XErmVvamAlULFEYMxkmpwAiUeO137UqJN/kwyybvEejrAKDiv9kOMcEwh9
 zKPfrDQ9UQVbInSMsjQrzaME1voYzdUfhd10vGCxFjQl4RFRy06Fj0SfRmsZeeB+
 geu5F6xnba5+IW07okT4FTAsMYPqc+PyP/sENiXQPHt43uSNMQTRdLCh0+7slJ0b
 Lr9S/euozG8L3wYrs7AUFPaMLDvaQoh4k2mp5oXk8MYYrmKWrLo3e7ZNxBptxjd8
 NmwfG9WWfCp4LpN8fMnhrUQxkIj+paDTg9ir1bKmpJwm81miXlWazTQHCw1Mige1
 u03P9Q0tUQP3khpVSEE583RLjr8NKR/zkXx97KTL54GsFmwSe4QdbXX3ZlVYj4md
 FN/8MBBqITNOwm4akObRN4ppOCSD+Qp5a94JOXqmmZ36u+wicAB7SZgVZq6PAmXv
 kQEYxkS0EALLyzMuK5DBB5zcEq6oT/9Gtr107An1gFGj1hqd1NeV0xPguSxUJLE8
 GEL2M9s5jyjbqFZHiz3hPDMB5SKY0T6y8sGtKNmAM6woaLxoRp++JcR/U8m3PpD/
 wJ432zHfi6ERp9WsAhyiYpijMj+xU3gCeo8JIP5vsQnaFtvqev8=
 =qauz
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmEKo44ACgkQ7G51OISz
 Hs0izw//RZp1Hn1xQTToi8PHof/qNviZESLMuhjtxXftG4bX1PZqvKDtBTYudo6+
 hsXyjHma/IFyRcNmzqookE1Fli5mrEm0FIdkyxfOTDur7JdRiTfDle7K2Gej0Maq
 DKuUO2qlXIxKZwe9YmPNKg+ZzFdlMmhdz6rCbAlumt859zErGK/1YLTqDZL4aiGS
 RZ43eY2BisU23JHbfIyVdvT4xdgL7vB4uadC7WIoM1WXTH/sv6VPd3rIC7oeAGBR
 q5/D1yfWvV6uyyX60WJnRH2vEUwv35UdNQkrIiFQ7SzonhbJbkE+ZL481g2IfZ3S
 OwdA2GMn/LE8+Q+IHtoISnUiyw8n7Lae69COHxUIcmggjIGSw5S5Bqoc4OuVw2dv
 BHICUux3IYwhHNv5Py3CNKiVLg9tKAvFoScrwofV5ToD/pgEBBjtbB0+OIoXtdMp
 yQVo/CKuiwwIDTrU1FpVC4rt90gS7EErpjOr/QG8paXMHiMxyhAPnBGLr9SPaueD
 LTXI3ZWNz+ZOFBLH34LZOMdyuWGNbQjwvi86Z5DuCaFL4ZXGAWVl5OUVF7oUyGkL
 vtgXfh6nzrsVoTBC7tfsuXuFossrTSvlpPtj2t2SB9hQEohE0pL6mS41inuJa4gP
 b6b0XRtazzskKT4ApEOoaNqlu0ZnDxC/xTdZN9nC5IZ/Mp+BrPQ=
 =603y
 -----END PGP SIGNATURE-----

Merge tag 'v5.4.135' into 5.4-2.3.x-imx

This is the 5.4.135 stable release

Conflicts (manual resolve):
- drivers/usb/cdns3/gadget.c:
Use NXP version, as upstream commit f53729b828 ("usb: cdns3: Enable
TDL_CHK only for OUT ep") is already applied.

- arch/arm64/boot/dts/freescale/imx8mq.dtsi:
Merge upstream commit 556cf02830 ("arm64: dts: imx8mq: assign PCIe
clocks") manually into NXP tree.

Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>
2021-08-04 14:25:42 +00:00
Nanyong Sun 8a85afc662 mm: slab: fix kmem_cache_create failed when sysfs node not destroyed
The commit d38a2b7a9c93 ("mm: memcg/slab: fix memory leak at non-root
kmem_cache destroy") introduced a problem: If one thread destroy a
kmem_cache A and another thread concurrently create a kmem_cache B,
which is mergeable with A and has same size with A, the B may fail to
create due to the duplicate sysfs node.
The scenario in detail:
1) Thread 1 uses kmem_cache_destroy() to destroy kmem_cache A which is
mergeable, it decreases A's refcount and if refcount is 0, then call
memcg_set_kmem_cache_dying() which set A->memcg_params.dying = true,
then unlock the slab_mutex and call flush_memcg_workqueue(), it may cost
a while.
Note: now the sysfs node(like '/kernel/slab/:0000248') of A is still
present, it will be deleted in shutdown_cache() which will be called
after flush_memcg_workqueue() is done and lock the slab_mutex again.
2) Now if thread 2 is coming, it use kmem_cache_create() to create B, which
is mergeable with A(their size is same), it gain the lock of slab_mutex,
then call __kmem_cache_alias() trying to find a mergeable node, because
of the below added code in commit d38a2b7a9c93 ("mm: memcg/slab: fix
memory leak at non-root kmem_cache destroy"), B is not mergeable with
A whose memcg_params.dying is true.

int slab_unmergeable(struct kmem_cache *s)
 	if (s->refcount < 0)
 		return 1;

	/*
	 * Skip the dying kmem_cache.
	 */
	if (s->memcg_params.dying)
		return 1;

 	return 0;
 }

So B has to create its own sysfs node by calling:
 create_cache->
	__kmem_cache_create->
		sysfs_slab_add->
			kobject_init_and_add
Because B is mergeable itself, its filename of sysfs node is based on its size,
like '/kernel/slab/:0000248', which is duplicate with A, and the sysfs
node of A is still present now, so kobject_init_and_add() will return
fail and result in kmem_cache_create() fail.

Concurrently modprobe and rmmod the two modules below can reproduce the issue
quickly: nf_conntrack_expect, se_sess_cache. See call trace in the end.

LTS versions of v4.19.y and v5.4.y have this problem, whereas linux versions after
v5.9 do not have this problem because the patchset: ("The new cgroup slab memory
controller") almost refactored memcg slab.

A potential solution(this patch belongs): Just let the dying kmem_cache be mergeable,
the slab_mutex lock can prevent the race between alias kmem_cache creating thread
and root kmem_cache destroying thread. In the destroying thread, after
flush_memcg_workqueue() is done, judge the refcount again, if someone
reference it again during un-lock time, we don't need to destroy the kmem_cache
completely, we can reuse it.

Another potential solution: revert the commit d38a2b7a9c93 ("mm: memcg/slab:
fix memory leak at non-root kmem_cache destroy"), compare to the fail of
kmem_cache_create, the memory leak in special scenario seems less harmful.

Call trace:
 sysfs: cannot create duplicate filename '/kernel/slab/:0000248'
 Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
 Call trace:
  dump_backtrace+0x0/0x198
  show_stack+0x24/0x30
  dump_stack+0xb0/0x100
  sysfs_warn_dup+0x6c/0x88
  sysfs_create_dir_ns+0x104/0x120
  kobject_add_internal+0xd0/0x378
  kobject_init_and_add+0x90/0xd8
  sysfs_slab_add+0x16c/0x2d0
  __kmem_cache_create+0x16c/0x1d8
  create_cache+0xbc/0x1f8
  kmem_cache_create_usercopy+0x1a0/0x230
  kmem_cache_create+0x50/0x68
  init_se_kmem_caches+0x38/0x258 [target_core_mod]
  target_core_init_configfs+0x8c/0x390 [target_core_mod]
  do_one_initcall+0x54/0x230
  do_init_module+0x64/0x1ec
  load_module+0x150c/0x16f0
  __se_sys_finit_module+0xf0/0x108
  __arm64_sys_finit_module+0x24/0x30
  el0_svc_common+0x80/0x1c0
  el0_svc_handler+0x78/0xe0
  el0_svc+0x10/0x260
 kobject_add_internal failed for :0000248 with -EEXIST, don't try to register things with the same name in the same directory.
 kmem_cache_create(se_sess_cache) failed with error -17
 Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
 Call trace:
  dump_backtrace+0x0/0x198
  show_stack+0x24/0x30
  dump_stack+0xb0/0x100
  kmem_cache_create_usercopy+0xa8/0x230
  kmem_cache_create+0x50/0x68
  init_se_kmem_caches+0x38/0x258 [target_core_mod]
  target_core_init_configfs+0x8c/0x390 [target_core_mod]
  do_one_initcall+0x54/0x230
  do_init_module+0x64/0x1ec
  load_module+0x150c/0x16f0
  __se_sys_finit_module+0xf0/0x108
  __arm64_sys_finit_module+0x24/0x30
  el0_svc_common+0x80/0x1c0
  el0_svc_handler+0x78/0xe0
  el0_svc+0x10/0x260

Fixes: d38a2b7a9c93 ("mm: memcg/slab: fix memory leak at non-root kmem_cache destroy")
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-25 14:35:14 +02:00
Andrey Zhizhikin e9646ca701 This is the 5.4.132 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmDu+p0ACgkQONu9yGCS
 aT5SOw/9F58e4gz7PSTn4A9oCTNodRPe9B9rzf3y1Ol0k7T1aeQoWsPFOkZpNSOJ
 tdOGEXnwYnLpMC7nuFshWv1uKGAL/weHADyGV6J37AntYFjpEFhJhSH7pGGhDk7V
 EeIl98luBynPXOKNnDvcrQweeRaHKOInQBT8JJzwwsZbF2oqfOqdU0A787BiRu+3
 zoi/mV0upDB443ji/JY0xj+o4jlbsuD0WxEqgkcD2YHL+QvU5Wr0mGys7m5gG9x7
 TpKpMic0ILrF1vt/znLL5rOlX497prTvZ74ZXV/DYizeYxqtl/UG3CZjo1uf2yqk
 pAXA57paz6DY2Ct+3QbJBeuer27bTz6SCClSS1om9AcUk6oNSdULmMdTGvQb0SLU
 wx1Cy8b2ei04SVl96+McKKZ6ln47LJediGn0qIdwC6O/XHHrLq4u5PkSnQxRU4pA
 GH1tP5oYy4GzL9RbBeiDJQETFiXwkexSEWVyuSc6BhqQXao9yVzmLQbL1zgjH/zO
 m/tckZ3vEg+ll8j4QJCisHRyqYhwfru4PsJQH9Q7q6CtIuGOsd0Z/OUcLuF6knXg
 jDOrDIykE/PnkQ2Dc2RhdONP1ud5j3oBnHvNHs6FDghRKjaixMQzg3g/RNtnAaTj
 +7Xsfbi6ntpZSDOaY7YNgt+ZH3l4YRnUL/xBA6qIygayz374nzI=
 =LU0G
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmD25ikACgkQ7G51OISz
 Hs1dcw/8DKef1hGC5O2WKfpInTYtgnClkyD5/yOnGAPMvMRDGybA3dRejpIEefNM
 Qol1XICjb0wdBDV0I+n+fnGbBgaX3g0N/pn16pdbbSPBBe1L+d97gZNZznDGHYZu
 033qtbxii8e0QTxTvO7nx4L80ZZsyPLchpPxowS/vd1Ezti+pTIU4y43MCc2jYLL
 KqUBDz72TkPLhgVZdDJ1z9gb+OoJ+sJPaeBrO57hpY/os9SxlMPeY56YrD3Hyfy5
 IHZw3bTCDiIXpHBaJG8fvuudaM5M8V3dbD6oXnEPo1Gzb1Y7WR4Z7q28g7arSYjP
 fMPd243mCXd1V7LpmapxXvFsnbdsA7oauTho50dwmEvxQf9jEgX6thBWAFsrItaS
 crHdOppS7Lc3FK8cTMxZd6ZyZpaU6sF183tMOteuhtwmF/uoy1LBHqLnAvtfWYrl
 InGcImgABRkiYBRyODlgC4UNLd49Svon/8HcbBZlmeGIkosXjo5r1itnipgnF/TB
 /NkHRkixYTBCnJZyx+9Lihqw+HMnHVfjOnIBjbXjzX9ITH/tiMn4y87E+x9vRQqr
 Td5AKJwiSXSWZBQoX+XNLqXRwjZKHVQe45J4gzL9dhCzi9bwK99BvBPWr8+JyI7w
 83YQfkhPju47+KFrEN6DUBxdYrROsJLsjgdTl38IlCi4SKoQSkE=
 =Li9I
 -----END PGP SIGNATURE-----

Merge tag 'v5.4.132' into 5.4-2.3.x-imx

This is the 5.4.132 stable release

Conflicts (manual resolve):
- drivers/gpu/drm/rockchip/cdn-dp-core.c:
Fix merge hiccup when integrating upstream commit 450c25b8a4
("drm/rockchip: cdn-dp-core: add missing clk_disable_unprepare() on
error in cdn_dp_grf_write()")

- drivers/perf/fsl_imx8_ddr_perf.c:
Port upstream commit 3fea9b708a ("drivers/perf: fix the missed
ida_simple_remove() in ddr_perf_probe()") manually to NXP version.

Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>
2021-07-20 15:04:13 +00:00
Miaohe Lin 49496327c2 mm/z3fold: fix potential memory leak in z3fold_destroy_pool()
[ Upstream commit dac0d1cfda56472378d330b1b76b9973557a7b1d ]

There is a memory leak in z3fold_destroy_pool() as it forgets to
free_percpu pool->unbuddied.  Call free_percpu for pool->unbuddied to fix
this issue.

Link: https://lkml.kernel.org/r/20210619093151.1492174-6-linmiaohe@huawei.com
Fixes: d30561c56f ("z3fold: use per-cpu unbuddied lists")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Hillf Danton <hdanton@sina.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14 16:53:47 +02:00
Miaohe Lin 4b515fa948 mm/huge_memory.c: don't discard hugepage if other processes are mapping it
[ Upstream commit babbbdd08af98a59089334eb3effbed5a7a0cf7f ]

If other processes are mapping any other subpages of the hugepage, i.e.
in pte-mapped thp case, page_mapcount() will return 1 incorrectly.  Then
we would discard the page while other processes are still mapping it.  Fix
it by using total_mapcount() which can tell whether other processes are
still mapping it.

Link: https://lkml.kernel.org/r/20210511134857.1581273-6-linmiaohe@huawei.com
Fixes: b8d3c4c300 ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called")
Reviewed-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14 16:53:47 +02:00
Andrey Zhizhikin 3aef3ef268 Linux 5.4.129
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE4n5dijQDou9mhzu83qZv95d3LNwFAmDcbxkACgkQ3qZv95d3
 LNxZMBAArNPLhVYdEDDFosb6Y/5RGjjZ/79OGHH0p5YiTo8D+wBHi+wXRl5Jp0PA
 3YVVU8lDTbeDm7E7uWeduWjFwEpsPBL8395scbhC6VR3PfnyunjarVXZgi6EHnMl
 p6HjXXtQ1jTrdDSziGDIhZVQT5FGb2/MMx9m69mfi5BTLjGfWy8chHFbC2GZszlp
 Znu9syjisUBbc4I4XHFgXw0hoQSSig6SUTZCrdTpIW/PZ0swfl8ZPxREh0CZNMpw
 Y2orRt+oHlkWPw1/sSkoTE1PRvXwNWFXyw5caOu846jAfhKtxO54SsqJqhM7VLHZ
 pdH4eb6q7AFyt0A62HkIqa5oabs5Vk9G24b8m5ggc2F/UTkHqgwUcMCud0d3DYL0
 Q7OEAmThQzHHKJ+CeNRJLsiKqVBNHmeS24B+ELldlAiX22vLr9pUsIb342Au1ZjR
 S3BTnneAbYGBv4qUoV2yUF9wQ/LxsFMSl/vmjCBOxg7c3LbKYChUwskYnvd6EwWj
 ObCyLU6FK9HWXSBSp/X+irlF1CLla+HuOC+Aej2U5a8DtmHId4LHMeq/XOxZ9s/8
 QUoX4rh5P+TJ8PIiTqXKrQo5rnR79MiYssIhUozKTdt9ZoMtXzI4mVLXN/yzAVD9
 v4aWYx8m2x17Wq+ptaLMSTSed4m3c25uEl4MucLBmKQV8ClAxW8=
 =Sijo
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmDchOoACgkQ7G51OISz
 Hs3s9Q/+OfeIcTzOY6qDX53PSxK5iF3xZEiPR5HOSVu1cY4+1tyt0bkYdQJ2N//d
 NH6YCuOsVI5chRxKN8b6I01U6xxQ0VNvj5RYmuC0AqhhzJBpfhh6FrrGuVxP+dCj
 aa2twfRZO+Y0yzrIYdSvYdRhJlEhMbKGNyHu871tyQQTlp0eQKy9lBhcCuVSCMso
 p64lRlK35GDi6TaWQ7R0SYOziWWSHByf4p3h4ZlpyhB7w6yMNjNrts4qSoZb2uOY
 I453WroS7tWV3qkOZ9FeqjHyg4mjBg87/wSMZ0DbBANinlQZW5YTOjE/WyPossL3
 C4jTP8NBLsng+ZfhAaMRfbvxGj5gbY0ghSOmsuglAOQDd+tlTfB23pkgHFlj6aie
 FUplJnTLd8VXPWkoNVKH6GfG2rZAUtQKM4EWDPt2KK3Dch9W3YH1bHL9EyxR0G7f
 aDBHM9egTyPOYNcfeVDj0KEJTrOxkmCcYz83D+eEDGKHi//Gcb+k9Gsg6q8X6Yxa
 ejFx5RBM5SW57LA70ZK3+HWhqfqcILYJT7Lxoue7bH+15vSGnTkmJKS80OUrP8w8
 y9gQ9WNsmjQUHvtKPAIZUZYlUS05FhtQ+faC8XyrLnfZj+ZCGBZor/KsNe0SP/UN
 bAYudfCn0YkjGfTNcljOe4oAnwmnVKAjGlQkHOFw+llmM0czEBk=
 =vpHo
 -----END PGP SIGNATURE-----

Merge tag 'v5.4.129' into 5.4-2.3.x-imx

Linux 5.4.129

Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>
2021-06-30 14:51:20 +00:00
Hugh Dickins 61168eafe0 mm, futex: fix shared futex pgoff on shmem huge page
[ Upstream commit fe19bd3dae3d15d2fbfdb3de8839a6ea0fe94264 ]

If more than one futex is placed on a shmem huge page, it can happen
that waking the second wakes the first instead, and leaves the second
waiting: the key's shared.pgoff is wrong.

When 3.11 commit 13d60f4b6a ("futex: Take hugepages into account when
generating futex_key"), the only shared huge pages came from hugetlbfs,
and the code added to deal with its exceptional page->index was put into
hugetlb source.  Then that was missed when 4.8 added shmem huge pages.

page_to_pgoff() is what others use for this nowadays: except that, as
currently written, it gives the right answer on hugetlbfs head, but
nonsense on hugetlbfs tails.  Fix that by calling hugetlbfs-specific
hugetlb_basepage_index() on PageHuge tails as well as on head.

Yes, it's unconventional to declare hugetlb_basepage_index() there in
pagemap.h, rather than in hugetlb.h; but I do not expect anything but
page_to_pgoff() ever to need it.

[akpm@linux-foundation.org: give hugetlb_basepage_index() prototype the correct scope]

Link: https://lkml.kernel.org/r/b17d946b-d09-326e-b42a-52884c36df32@google.com
Fixes: 800d8c63b2 ("shmem: add huge pages support")
Reported-by: Neel Natu <neelnatu@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Zhang Yi <wetpzy@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Note on stable backport: leave redundant #include <linux/hugetlb.h>
in kernel/futex.c, to avoid conflict over the header files included.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30 08:47:55 -04:00
Hugh Dickins a33b70d625 mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk()
Aha! Shouldn't that quick scan over pte_none()s make sure that it holds
ptlock in the PVMW_SYNC case? That too might have been responsible for
BUGs or WARNs in split_huge_page_to_list() or its unmap_page(), though
I've never seen any.

Link: https://lkml.kernel.org/r/1bdf384c-8137-a149-2a1e-475a4791c3c@google.com
Link: https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/
Fixes: ace71a19ce ("mm: introduce page_vma_mapped_walk()")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Wang Yugui <wangyugui@e16-tech.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30 08:47:55 -04:00
Hugh Dickins e045e9e79d mm/thp: fix page_vma_mapped_walk() if THP mapped by ptes
Running certain tests with a DEBUG_VM kernel would crash within hours,
on the total_mapcount BUG() in split_huge_page_to_list(), while trying
to free up some memory by punching a hole in a shmem huge page: split's
try_to_unmap() was unable to find all the mappings of the page (which,
on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).

Crash dumps showed two tail pages of a shmem huge page remained mapped
by pte: ptes in a non-huge-aligned vma of a gVisor process, at the end
of a long unmapped range; and no page table had yet been allocated for
the head of the huge page to be mapped into.

Although designed to handle these odd misaligned huge-page-mapped-by-pte
cases, page_vma_mapped_walk() falls short by returning false prematurely
when !pmd_present or !pud_present or !p4d_present or !pgd_present: there
are cases when a huge page may span the boundary, with ptes present in
the next.

Restructure page_vma_mapped_walk() as a loop to continue in these cases,
while keeping its layout much as before.  Add a step_forward() helper to
advance pvmw->address across those boundaries: originally I tried to use
mm's standard p?d_addr_end() macros, but hit the same crash 512 times
less often: because of the way redundant levels are folded together, but
folded differently in different configurations, it was just too
difficult to use them correctly; and step_forward() is simpler anyway.

Link: https://lkml.kernel.org/r/fedb8632-1798-de42-f39e-873551d5bc81@google.com
Fixes: ace71a19ce ("mm: introduce page_vma_mapped_walk()")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30 08:47:54 -04:00
Hugh Dickins 037a1d67d2 mm: page_vma_mapped_walk(): get vma_address_end() earlier
page_vma_mapped_walk() cleanup: get THP's vma_address_end() at the
start, rather than later at next_pte.

It's a little unnecessary overhead on the first call, but makes for a
simpler loop in the following commit.

Link: https://lkml.kernel.org/r/4542b34d-862f-7cb4-bb22-e0df6ce830a2@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30 08:47:54 -04:00
Hugh Dickins fa89d53694 mm: page_vma_mapped_walk(): use goto instead of while (1)
page_vma_mapped_walk() cleanup: add a label this_pte, matching next_pte,
and use "goto this_pte", in place of the "while (1)" loop at the end.

Link: https://lkml.kernel.org/r/a52b234a-851-3616-2525-f42736e8934@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30 08:47:54 -04:00
Hugh Dickins a499febd99 mm: page_vma_mapped_walk(): add a level of indentation
page_vma_mapped_walk() cleanup: add a level of indentation to much of
the body, making no functional change in this commit, but reducing the
later diff when this is all converted to a loop.

[hughd@google.com: : page_vma_mapped_walk(): add a level of indentation fix]
  Link: https://lkml.kernel.org/r/7f817555-3ce1-c785-e438-87d8efdcaf26@google.com

Link: https://lkml.kernel.org/r/efde211-f3e2-fe54-977-ef481419e7f3@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30 08:47:54 -04:00
Hugh Dickins b1783bf8c8 mm: page_vma_mapped_walk(): crossing page table boundary
page_vma_mapped_walk() cleanup: adjust the test for crossing page table
boundary - I believe pvmw->address is always page-aligned, but nothing
else here assumed that; and remember to reset pvmw->pte to NULL after
unmapping the page table, though I never saw any bug from that.

Link: https://lkml.kernel.org/r/799b3f9c-2a9e-dfef-5d89-26e9f76fd97@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30 08:47:54 -04:00
Hugh Dickins 80b2270a14 mm: page_vma_mapped_walk(): prettify PVMW_MIGRATION block
page_vma_mapped_walk() cleanup: rearrange the !pmd_present() block to
follow the same "return not_found, return not_found, return true"
pattern as the block above it (note: returning not_found there is never
premature, since existence or prior existence of huge pmd guarantees
good alignment).

Link: https://lkml.kernel.org/r/378c8650-1488-2edf-9647-32a53cf2e21@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30 08:47:54 -04:00
Hugh Dickins ef161ccaca mm: page_vma_mapped_walk(): use pmde for *pvmw->pmd
page_vma_mapped_walk() cleanup: re-evaluate pmde after taking lock, then
use it in subsequent tests, instead of repeatedly dereferencing pointer.

Link: https://lkml.kernel.org/r/53fbc9d-891e-46b2-cb4b-468c3b19238e@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30 08:47:53 -04:00
Hugh Dickins 4961160272 mm: page_vma_mapped_walk(): settle PageHuge on entry
page_vma_mapped_walk() cleanup: get the hugetlbfs PageHuge case out of
the way at the start, so no need to worry about it later.

Link: https://lkml.kernel.org/r/e31a483c-6d73-a6bb-26c5-43c3b880a2@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30 08:47:53 -04:00
Hugh Dickins 52e2b20fb5 mm: page_vma_mapped_walk(): use page for pvmw->page
Patch series "mm: page_vma_mapped_walk() cleanup and THP fixes".

I've marked all of these for stable: many are merely cleanups, but I
think they are much better before the main fix than after.

This patch (of 11):

page_vma_mapped_walk() cleanup: sometimes the local copy of pvwm->page
was used, sometimes pvmw->page itself: use the local copy "page"
throughout.

Link: https://lkml.kernel.org/r/589b358c-febc-c88e-d4c2-7834b37fa7bf@google.com
Link: https://lkml.kernel.org/r/88e67645-f467-c279-bf5e-af4b5c6b13eb@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Will Deacon <will@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30 08:47:53 -04:00
Yang Shi 82ee7326af mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for split
[ Upstream commit 504e070dc08f757bccaed6d05c0f53ecbfac8a23 ]

When debugging the bug reported by Wang Yugui [1], try_to_unmap() may
fail, but the first VM_BUG_ON_PAGE() just checks page_mapcount() however
it may miss the failure when head page is unmapped but other subpage is
mapped.  Then the second DEBUG_VM BUG() that check total mapcount would
catch it.  This may incur some confusion.

As this is not a fatal issue, so consolidate the two DEBUG_VM checks
into one VM_WARN_ON_ONCE_PAGE().

[1] https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/

Link: https://lkml.kernel.org/r/d0f0db68-98b8-ebfb-16dc-f29df24cf012@google.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jue Wang <juew@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Note on stable backport: fixed up variables in split_huge_page_to_list(),
and fixed up the conflict on ttu_flags in unmap_page().

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30 08:47:53 -04:00
Hugh Dickins bd43892152 mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
[ Upstream commit 22061a1ffabdb9c3385de159c5db7aac3a4df1cc ]

There is a race between THP unmapping and truncation, when truncate sees
pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared
it, but before its page_remove_rmap() gets to decrement
compound_mapcount: generating false "BUG: Bad page cache" reports that
the page is still mapped when deleted.  This commit fixes that, but not
in the way I hoped.

The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
instead of unmap_mapping_range() in truncate_cleanup_page(): it has
often been an annoyance that we usually call unmap_mapping_range() with
no pages locked, but there apply it to a single locked page.
try_to_unmap() looks more suitable for a single locked page.

However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
it is used to insert THP migration entries, but not used to unmap THPs.
Copy zap_huge_pmd() and add THP handling now? Perhaps, but their TLB
needs are different, I'm too ignorant of the DAX cases, and couldn't
decide how far to go for anon+swap.  Set that aside.

The second attempt took a different tack: make no change in truncate.c,
but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
clearing it initially, then pmd_clear() between page_remove_rmap() and
unlocking at the end.  Nice.  But powerpc blows that approach out of the
water, with its serialize_against_pte_lookup(), and interesting pgtable
usage.  It would need serious help to get working on powerpc (with a
minor optimization issue on s390 too).  Set that aside.

Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
delay, after unmapping in truncate_cleanup_page()? Perhaps, but though
that's likely to reduce or eliminate the number of incidents, it would
give less assurance of whether we had identified the problem correctly.

This successful iteration introduces "unmap_mapping_page(page)" instead
of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
with an addition to details.  Then zap_pmd_range() watches for this
case, and does spin_unlock(pmd_lock) if so - just like
page_vma_mapped_walk() now does in the PVMW_SYNC case.  Not pretty, but
safe.

Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
assert its interface; but currently that's only used to make sure that
page->mapping is stable, and zap_pmd_range() doesn't care if the page is
locked or not.  Along these lines, in invalidate_inode_pages2_range()
move the initial unmap_mapping_range() out from under page lock, before
then calling unmap_mapping_page() under page lock if still mapped.

Link: https://lkml.kernel.org/r/a2a4a148-cdd8-942c-4ef8-51b77f643dbe@google.com
Fixes: fc127da085 ("truncate: handle file thp")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jue Wang <juew@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Note on stable backport: fixed up call to truncate_cleanup_page()
in truncate_inode_pages_range().  Use hpage_nr_pages() in
unmap_mapping_page().

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30 08:47:53 -04:00
Jue Wang b767134ec3 mm/thp: fix page_address_in_vma() on file THP tails
commit 31657170deaf1d8d2f6a1955fbc6fa9d228be036 upstream.

Anon THP tails were already supported, but memory-failure may need to
use page_address_in_vma() on file THP tails, which its page->mapping
check did not permit: fix it.

hughd adds: no current usage is known to hit the issue, but this does
fix a subtle trap in a general helper: best fixed in stable sooner than
later.

Link: https://lkml.kernel.org/r/a0d9b53-bf5d-8bab-ac5-759dc61819c1@google.com
Fixes: 800d8c63b2 ("shmem: add huge pages support")
Signed-off-by: Jue Wang <juew@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30 08:47:53 -04:00
Hugh Dickins 41432a8a67 mm/thp: fix vma_address() if virtual address below file offset
[ Upstream commit 494334e43c16d63b878536a26505397fce6ff3a2 ]

Running certain tests with a DEBUG_VM kernel would crash within hours,
on the total_mapcount BUG() in split_huge_page_to_list(), while trying
to free up some memory by punching a hole in a shmem huge page: split's
try_to_unmap() was unable to find all the mappings of the page (which,
on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).

When that BUG() was changed to a WARN(), it would later crash on the
VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma) in
mm/internal.h:vma_address(), used by rmap_walk_file() for
try_to_unmap().

vma_address() is usually correct, but there's a wraparound case when the
vm_start address is unusually low, but vm_pgoff not so low:
vma_address() chooses max(start, vma->vm_start), but that decides on the
wrong address, because start has become almost ULONG_MAX.

Rewrite vma_address() to be more careful about vm_pgoff; move the
VM_BUG_ON_VMA() out of it, returning -EFAULT for errors, so that it can
be safely used from page_mapped_in_vma() and page_address_in_vma() too.

Add vma_address_end() to apply similar care to end address calculation,
in page_vma_mapped_walk() and page_mkclean_one() and try_to_unmap_one();
though it raises a question of whether callers would do better to supply
pvmw->end to page_vma_mapped_walk() - I chose not, for a smaller patch.

An irritation is that their apparent generality breaks down on KSM
pages, which cannot be located by the page->index that page_to_pgoff()
uses: as commit 4b0ece6fa0 ("mm: migrate: fix remove_migration_pte()
for ksm pages") once discovered.  I dithered over the best thing to do
about that, and have ended up with a VM_BUG_ON_PAGE(PageKsm) in both
vma_address() and vma_address_end(); though the only place in danger of
using it on them was try_to_unmap_one().

Sidenote: vma_address() and vma_address_end() now use compound_nr() on a
head page, instead of thp_size(): to make the right calculation on a
hugetlbfs page, whether or not THPs are configured.  try_to_unmap() is
used on hugetlbfs pages, but perhaps the wrong calculation never
mattered.

Link: https://lkml.kernel.org/r/caf1c1a3-7cfb-7f8f-1beb-ba816e932825@google.com
Fixes: a8fa41ad2f ("mm, rmap: check all VMAs that PTE-mapped THP can be part of")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jue Wang <juew@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Note on stable backport: fixed up conflicts on intervening thp_size().

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30 08:47:52 -04:00
Hugh Dickins 4b0a34e222 mm/thp: try_to_unmap() use TTU_SYNC for safe splitting
[ Upstream commit 732ed55823fc3ad998d43b86bf771887bcc5ec67 ]

Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE
(!unmap_success): with dump_page() showing mapcount:1, but then its raw
struct page output showing _mapcount ffffffff i.e.  mapcount 0.

And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed,
it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)),
and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG():
all indicative of some mapcount difficulty in development here perhaps.
But the !CONFIG_DEBUG_VM path handles the failures correctly and
silently.

I believe the problem is that once a racing unmap has cleared pte or
pmd, try_to_unmap_one() may skip taking the page table lock, and emerge
from try_to_unmap() before the racing task has reached decrementing
mapcount.

Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that
follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding
TTU_SYNC to the options, and passing that from unmap_page().

When CONFIG_DEBUG_VM, or for non-debug too? Consensus is to do the same
for both: the slight overhead added should rarely matter, except perhaps
if splitting sparsely-populated multiply-mapped shmem.  Once confident
that bugs are fixed, TTU_SYNC here can be removed, and the race
tolerated.

Link: https://lkml.kernel.org/r/c1e95853-8bcd-d8fd-55fa-e7f2488e78f@google.com
Fixes: fec89c109f ("thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jue Wang <juew@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Note on stable backport: upstream TTU_SYNC 0x10 takes the value which
5.11 commit 013339df116c ("mm/rmap: always do TTU_IGNORE_ACCESS") freed.
It is very tempting to backport that commit (as 5.10 already did) and
make no change here; but on reflection, good as that commit is, I'm
reluctant to include any possible side-effect of it in this series.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30 08:47:52 -04:00
Hugh Dickins bd092a0f19 mm/thp: make is_huge_zero_pmd() safe and quicker
commit 3b77e8c8cde581dadab9a0f1543a347e24315f11 upstream.

Most callers of is_huge_zero_pmd() supply a pmd already verified
present; but a few (notably zap_huge_pmd()) do not - it might be a pmd
migration entry, in which the pfn is encoded differently from a present
pmd: which might pass the is_huge_zero_pmd() test (though not on x86,
since L1TF forced us to protect against that); or perhaps even crash in
pmd_page() applied to a swap-like entry.

Make it safe by adding pmd_present() check into is_huge_zero_pmd()
itself; and make it quicker by saving huge_zero_pfn, so that
is_huge_zero_pmd() will not need to do that pmd_page() lookup each time.

__split_huge_pmd_locked() checked pmd_trans_huge() before: that worked,
but is unnecessary now that is_huge_zero_pmd() checks present.

Link: https://lkml.kernel.org/r/21ea9ca-a1f5-8b90-5e88-95fb1c49bbfa@google.com
Fixes: e71769ae52 ("mm: enable thp migration for shmem thp")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jue Wang <juew@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30 08:47:52 -04:00
Hugh Dickins 4c37d7f269 mm/thp: fix __split_huge_pmd_locked() on shmem migration entry
[ Upstream commit 99fa8a48203d62b3743d866fc48ef6abaee682be ]

Patch series "mm/thp: fix THP splitting unmap BUGs and related", v10.

Here is v2 batch of long-standing THP bug fixes that I had not got
around to sending before, but prompted now by Wang Yugui's report
https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/

Wang Yugui has tested a rollup of these fixes applied to 5.10.39, and
they have done no harm, but have *not* fixed that issue: something more
is needed and I have no idea of what.

This patch (of 7):

Stressing huge tmpfs page migration racing hole punch often crashed on
the VM_BUG_ON(!pmd_present) in pmdp_huge_clear_flush(), with DEBUG_VM=y
kernel; or shortly afterwards, on a bad dereference in
__split_huge_pmd_locked() when DEBUG_VM=n.  They forgot to allow for pmd
migration entries in the non-anonymous case.

Full disclosure: those particular experiments were on a kernel with more
relaxed mmap_lock and i_mmap_rwsem locking, and were not repeated on the
vanilla kernel: it is conceivable that stricter locking happens to avoid
those cases, or makes them less likely; but __split_huge_pmd_locked()
already allowed for pmd migration entries when handling anonymous THPs,
so this commit brings the shmem and file THP handling into line.

And while there: use old_pmd rather than _pmd, as in the following
blocks; and make it clearer to the eye that the !vma_is_anonymous()
block is self-contained, making an early return after accounting for
unmapping.

Link: https://lkml.kernel.org/r/af88612-1473-2eaa-903-8d1a448b26@google.com
Link: https://lkml.kernel.org/r/dd221a99-efb3-cd1d-6256-7e646af29314@google.com
Fixes: e71769ae52 ("mm: enable thp migration for shmem thp")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jue Wang <juew@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Note on stable backport: this commit made intervening cleanups in
pmdp_huge_clear_flush() redundant: here it's rediffed to skip them.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30 08:47:52 -04:00
Xu Yu 7ce4b73d34 mm, thp: use head page in __migration_entry_wait()
commit ffc90cbb2970ab88b66ea51dd580469eede57b67 upstream.

We notice that hung task happens in a corner but practical scenario when
CONFIG_PREEMPT_NONE is enabled, as follows.

Process 0                       Process 1                     Process 2..Inf
split_huge_page_to_list
    unmap_page
        split_huge_pmd_address
                                __migration_entry_wait(head)
                                                              __migration_entry_wait(tail)
    remap_page (roll back)
        remove_migration_ptes
            rmap_walk_anon
                cond_resched

Where __migration_entry_wait(tail) is occurred in kernel space, e.g.,
copy_to_user in fstat, which will immediately fault again without
rescheduling, and thus occupy the cpu fully.

When there are too many processes performing __migration_entry_wait on
tail page, remap_page will never be done after cond_resched.

This makes __migration_entry_wait operate on the compound head page,
thus waits for remap_page to complete, whether the THP is split
successfully or roll back.

Note that put_and_wait_on_page_locked helps to drop the page reference
acquired with get_page_unless_zero, as soon as the page is on the wait
queue, before actually waiting.  So splitting the THP is only prevented
for a brief interval.

Link: https://lkml.kernel.org/r/b9836c1dd522e903891760af9f0c86a2cce987eb.1623144009.git.xuyu@linux.alibaba.com
Fixes: ba98828088 ("thp: add option to setup migration entries during PMD split")
Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com>
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30 08:47:52 -04:00
Miaohe Lin 68ce37ebe0 mm/rmap: use page_not_mapped in try_to_unmap()
[ Upstream commit b7e188ec98b1644ff70a6d3624ea16aadc39f5e0 ]

page_mapcount_is_zero() calculates accurately how many mappings a hugepage
has in order to check against 0 only.  This is a waste of cpu time.  We
can do this via page_not_mapped() to save some possible atomic_read
cycles.  Remove the function page_mapcount_is_zero() as it's not used
anymore and move page_not_mapped() above try_to_unmap() to avoid
identifier undeclared compilation error.

Link: https://lkml.kernel.org/r/20210130084904.35307-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30 08:47:52 -04:00
Miaohe Lin 432b61863a mm/rmap: remove unneeded semicolon in page_not_mapped()
[ Upstream commit e0af87ff7afcde2660be44302836d2d5618185af ]

Remove extra semicolon without any functional change intended.

Link: https://lkml.kernel.org/r/20210127093425.39640-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30 08:47:51 -04:00
Andrey Zhizhikin db8ff65069 This is the 5.4.128 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmDTLA4ACgkQONu9yGCS
 aT45xg/+IvxFaIOtutEBkFCJvEurRWSozjBKAfX9xtJQGSSKVyDvh7GZWfEXMxZc
 oNf8DWQKvaiZj2mRdgYp6Ilo27Ps6aN3vCo09z+U3mfGQLMbNpPYEvSq6Twl26NB
 8lL8b++0Jo7P+eOALHohBS125/E0etqhoc2HXDFp6pfksj6J7klxlyQ2NX9Ih8xm
 l7Cto5flCHM9g20/CNsqxXPWiuBKnzSvp9YH9HMDgjOV6YSktLGTHAJ8omjPm0V/
 pQVFOo4Kyx34exdA/IzrM/yV4iDThVtwL6+bNErWtl6LwiIcNK3esARYTNjbBBhK
 W156adxp6kl6LqMADr/y77WqvcH6H2PhpRnMj+6t21FpK7cTbXfqvxBfpOvE1Buh
 in95LJN1Iins1PTozBVHcUIpdESO5AN8/2aHq0LRLmVbaLlo6aj+sjdHNPvf7HwW
 8LDHtpGNao/spMuZmvvH+6i3iwuciINCRY9TVBDgkT5LhWhRHBl6+uSLEX/d+s3Z
 663Q6HPu+cfubR7UC8+QsMMtf7KD2yvQuadAz6n/Z41vvSYIUHPGsYtZUmsef3jP
 n4CTAmGavtyR5jaQNkuw8nnIn7cthONw94foFheBH0doxmkXPKcwqmWO9DH77n58
 unMT31ArVg9ObrO/YmLjEaV9X7VlfRf6yw7tey1RJXgrSD3nwgk=
 =9+GF
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmDTTjIACgkQ7G51OISz
 Hs1mJw/5AaIlIv+kMC0qP9aG5Y1uoUQ+fewr4Ofo9Kt8QlVcogdeM16URF7O2CDG
 mICWVXRqY+xWl/84XWJs11RRxfMnXp85myi1jyalOFDYojJp2qOkOyEi8zMg7wui
 N1kkPxEworm6ILrZRKNgT36ZfSEGAD2ogkp4wS/Fldwh2RNHtGUSw3xAPcqg0yDq
 +Ve3V06MDul/dyRup3MYtQ/Kxb5wCc+Lfwh+eXYZIu/DfXy3E4K3w8YB7NPazlTw
 dBlWU/S3Y0uyoq+DvfFQ+nL0GTe0ezGQB5n5n1eNRWg+mEn3IiWXkrBJfha2Swcl
 vv+Fy/5ro9elcn1oQF5uvwDoGHhq/KkR7TjRICY9zmVvJ79IcqZJ/Bdg5CEXM+xc
 EvZDAdXn26y2xKZJzKGQSe7QJurOuQpdW497h1DLGIhDs2sGDjhXuISPcNpjPJKG
 PdgpJ9kG4v2cYnW58UTdzQKNFg7jjFUiiQV/IIoiVlxhg7dVQnSKCgXHcngiZiuQ
 eUR9p7ahfEOlG2N5qTlJqom0HtiJ6I4a5TDQhXoEYlKfxr9711f2+cramhf2cle+
 +G64chvmKcv7Fyx5yIwSJrTJfi6vsuW5VMfXrvGIqjFUEjIGe9rQMhWH34A7s5tG
 RD0Tg90JJr0rNsNa+LZNPbRynbdPBbrVWdrVyXN+9HDs40oe8BY=
 =f7qZ
 -----END PGP SIGNATURE-----

Merge tag 'v5.4.128' into 5.4-2.3.x-imx

This is the 5.4.128 stable release

Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>
2021-06-23 15:07:27 +00:00
Andrew Morton 4f69c89306 mm/slub.c: include swab.h
commit 1b3865d016815cbd69a1879ca1c8a8901fda1072 upstream.

Fixes build with CONFIG_SLAB_FREELIST_HARDENED=y.

Hopefully.  But it's the right thing to do anwyay.

Fixes: 1ad53d9fa3f61 ("slub: improve bit diffusion for freelist ptr obfuscation")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=213417
Reported-by: <vannguye@cisco.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-23 14:41:30 +02:00
Kees Cook 9ddeea35c4 mm/slub: fix redzoning for small allocations
commit 74c1d3e081533825f2611e46edea1fcdc0701985 upstream.

The redzone area for SLUB exists between s->object_size and s->inuse
(which is at least the word-aligned object_size).  If a cache were
created with an object_size smaller than sizeof(void *), the in-object
stored freelist pointer would overwrite the redzone (e.g.  with boot
param "slub_debug=ZF"):

  BUG test (Tainted: G    B            ): Right Redzone overwritten
  -----------------------------------------------------------------------------

  INFO: 0xffff957ead1c05de-0xffff957ead1c05df @offset=1502. First byte 0x1a instead of 0xbb
  INFO: Slab 0xffffef3950b47000 objects=170 used=170 fp=0x0000000000000000 flags=0x8000000000000200
  INFO: Object 0xffff957ead1c05d8 @offset=1496 fp=0xffff957ead1c0620

  Redzone  (____ptrval____): bb bb bb bb bb bb bb bb    ........
  Object   (____ptrval____): f6 f4 a5 40 1d e8          ...@..
  Redzone  (____ptrval____): 1a aa                      ..
  Padding  (____ptrval____): 00 00 00 00 00 00 00 00    ........

Store the freelist pointer out of line when object_size is smaller than
sizeof(void *) and redzoning is enabled.

Additionally remove the "smaller than sizeof(void *)" check under
CONFIG_DEBUG_VM in kmem_cache_sanity_check() as it is now redundant:
SLAB and SLOB both handle small sizes.

(Note that no caches within this size range are known to exist in the
kernel currently.)

Link: https://lkml.kernel.org/r/20210608183955.280836-3-keescook@chromium.org
Fixes: 81819f0fc8 ("SLUB core")
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Lin, Zhenpeng" <zplin@psu.edu>
Cc: Marco Elver <elver@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-23 14:41:30 +02:00
Kees Cook c0837e021d mm/slub: clarify verification reporting
commit 8669dbab2ae56085c128894b181c2aa50f97e368 upstream.

Patch series "Actually fix freelist pointer vs redzoning", v4.

This fixes redzoning vs the freelist pointer (both for middle-position
and very small caches).  Both are "theoretical" fixes, in that I see no
evidence of such small-sized caches actually be used in the kernel, but
that's no reason to let the bugs continue to exist, especially since
people doing local development keep tripping over it.  :)

This patch (of 3):

Instead of repeating "Redzone" and "Poison", clarify which sides of
those zones got tripped.  Additionally fix column alignment in the
trailer.

Before:

  BUG test (Tainted: G    B            ): Redzone overwritten
  ...
  Redzone (____ptrval____): bb bb bb bb bb bb bb bb      ........
  Object (____ptrval____): f6 f4 a5 40 1d e8            ...@..
  Redzone (____ptrval____): 1a aa                        ..
  Padding (____ptrval____): 00 00 00 00 00 00 00 00      ........

After:

  BUG test (Tainted: G    B            ): Right Redzone overwritten
  ...
  Redzone  (____ptrval____): bb bb bb bb bb bb bb bb      ........
  Object   (____ptrval____): f6 f4 a5 40 1d e8            ...@..
  Redzone  (____ptrval____): 1a aa                        ..
  Padding  (____ptrval____): 00 00 00 00 00 00 00 00      ........

The earlier commits that slowly resulted in the "Before" reporting were:

  d86bd1bece ("mm/slub: support left redzone")
  ffc79d2880 ("slub: use print_hex_dump")
  2492268472 ("SLUB: change error reporting format to follow lockdep loosely")

Link: https://lkml.kernel.org/r/20210608183955.280836-1-keescook@chromium.org
Link: https://lkml.kernel.org/r/20210608183955.280836-2-keescook@chromium.org
Link: https://lore.kernel.org/lkml/cfdb11d7-fb8e-e578-c939-f7f5fb69a6bd@suse.cz/
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Marco Elver <elver@google.com>
Cc: "Lin, Zhenpeng" <zplin@psu.edu>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-23 14:41:30 +02:00
yangerkun 566345aaab mm/memory-failure: make sure wait for page writeback in memory_failure
[ Upstream commit e8675d291ac007e1c636870db880f837a9ea112a ]

Our syzkaller trigger the "BUG_ON(!list_empty(&inode->i_wb_list))" in
clear_inode:

  kernel BUG at fs/inode.c:519!
  Internal error: Oops - BUG: 0 [#1] SMP
  Modules linked in:
  Process syz-executor.0 (pid: 249, stack limit = 0x00000000a12409d7)
  CPU: 1 PID: 249 Comm: syz-executor.0 Not tainted 4.19.95
  Hardware name: linux,dummy-virt (DT)
  pstate: 80000005 (Nzcv daif -PAN -UAO)
  pc : clear_inode+0x280/0x2a8
  lr : clear_inode+0x280/0x2a8
  Call trace:
    clear_inode+0x280/0x2a8
    ext4_clear_inode+0x38/0xe8
    ext4_free_inode+0x130/0xc68
    ext4_evict_inode+0xb20/0xcb8
    evict+0x1a8/0x3c0
    iput+0x344/0x460
    do_unlinkat+0x260/0x410
    __arm64_sys_unlinkat+0x6c/0xc0
    el0_svc_common+0xdc/0x3b0
    el0_svc_handler+0xf8/0x160
    el0_svc+0x10/0x218
  Kernel panic - not syncing: Fatal exception

A crash dump of this problem show that someone called __munlock_pagevec
to clear page LRU without lock_page: do_mmap -> mmap_region -> do_munmap
-> munlock_vma_pages_range -> __munlock_pagevec.

As a result memory_failure will call identify_page_state without
wait_on_page_writeback.  And after truncate_error_page clear the mapping
of this page.  end_page_writeback won't call sb_clear_inode_writeback to
clear inode->i_wb_list.  That will trigger BUG_ON in clear_inode!

Fix it by checking PageWriteback too to help determine should we skip
wait_on_page_writeback.

Link: https://lkml.kernel.org/r/20210604084705.3729204-1-yangerkun@huawei.com
Fixes: 0bc1f8b068 ("hwpoison: fix the handling path of the victimized page frame that belong to non-LRU")
Signed-off-by: yangerkun <yangerkun@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-23 14:41:23 +02:00
Andrey Zhizhikin 276aedc8f1 This is the 5.4.125 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmDB+Z8ACgkQONu9yGCS
 aT5qig//WVut449WUeYQLKD8rAB5CUVm2Xl3509Ts8W6LSzYGHiYv1SRVeH2y1lS
 QnfCnBciopl2UyYxqXGQwoRYdY1T2E/MWUmwGUk0/qlZYOzg5xQ368Shm0lvohJI
 DsywZrYqJDUCoeyXoWJYrq/3RiAvMK30teKDcn1A2HhhWdo0nsGLp1GUX396ptcV
 3xw2ZvCVwuikwxq5jlQKUEkH59TD/ZkCzvn9gfd86FY1R0ohApLJckhGIuT3wA1c
 Tfekgvfngx1HcEWIAzWFqZPoB8mOF5pn06yZhuPdMKa8UUq78ckN7kbchERj2wJD
 cDFSQQrMI3nL9sA8ryYV1YFl3fyGX5Epm4O465whzjKWoZ9HwN+iwl6Qv+kOmX41
 YUmpUplhsPN+I7+cX1jF7Ohw583uDbFPw6XbyZ0ArZr03JVVv4Vjrv5QA9fVHR06
 OP7+zEUlBtu/g3k0Bj5MU8UKem0shXavkPqukrtB+MhrXh2VngEXEVOvKMOFgA4b
 BnBEga4SrCR/wB+SucIV4fqzV0tq4HD/cPpy67OafrWoqhwlnBsMCQUd+puxkCnM
 y+eEoRwTzRSW+U9y8KdAERW8qSR/vCyKCUoaKxOV3Jj0v8xp0Y6VHKlKmb//w5Gn
 Lk7sNjD60Um3Au53A5pJvh8qNg+OsNc46sEmGGndE4Mrada93gE=
 =O2C+
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmDCIf8ACgkQ7G51OISz
 Hs04IA/9HPtgSX+5Uha8T9IWUKxKwK8BwXAnBnowBkt76X50PLR7/i1wD3WmNdMc
 nVe+bScX6gTjhbqICO2toDZ+lcqWsM00cHPnjZGGwnGDFIvlbxYAYZt/dPTHxgze
 YXDu7dxY5Cb3tAYBX1Ng165Rti9gJC8QNGOLXiCOUhDSTNMepe02wi6bKR3jN/hm
 jjl02Qo9BQI70a1w3zOFHH8ffQuUdOoTFji8hq2u+cJ1tP6FuftJyPvIAm+MDLNd
 83dg1P4eg73Qk+tp93OKrSG3pnCngxgveCB+U3SQnCd2b83asNVigjxoxkrZZJ7L
 9kxq4ifyAfH9TLQJ5lo2xOdQ1ra0+KTBwYKr2X1/N5mrXnmi9OCt54tXFnkPcJLN
 S0HAP9cCf+NtoACirUfNETeZJDaISvHiYT8XhbJ+y1mr+3pbN/4DQ3P/4u1ykoyF
 XBQwuAEw6ljz22HbZBuLrsB339CSwVuJbSaFrmeuUsX7AKIA/p45lr6L5JQssTyD
 a0NWKWFMJ7rV/f10u/B24kZwcSNghx1xMuX8hfBnyPtbR4ChnlnKnSSLsF+AJQEW
 7chnIejPa0UwQAkZmd/b1qaqZq5oIct3BFRZUfcledlej8HweLNJGyy8+T5ZSu5d
 G5ku8CU2hIgnKSBaqF7AaqkcDga2fjelGJetJjqdPwgIzMJmqnE=
 =b+bL
 -----END PGP SIGNATURE-----

Merge tag 'v5.4.125' into 5.4-2.3.x-imx

This is the 5.4.125 stable release

Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>
2021-06-10 14:30:21 +00:00
Matthew Wilcox (Oracle) 3b7f3cab1d mm/filemap: fix storing to a THP shadow entry
commit 198b62f83eef1d605d70eca32759c92cdcc14175 upstream

When a THP is removed from the page cache by reclaim, we replace it with a
shadow entry that occupies all slots of the XArray previously occupied by
the THP.  If the user then accesses that page again, we only allocate a
single page, but storing it into the shadow entry replaces all entries
with that one page.  That leads to bugs like

page dumped because: VM_BUG_ON_PAGE(page_to_pgoff(page) != offset)
------------[ cut here ]------------
kernel BUG at mm/filemap.c:2529!

https://bugzilla.kernel.org/show_bug.cgi?id=206569

This is hard to reproduce with mainline, but happens regularly with the
THP patchset (as so many more THPs are created).  This solution is take
from the THP patchset.  It splits the shadow entry into order-0 pieces at
the time that we bring a new page into cache.

Fixes: 99cb0dbd47 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Qian Cai <cai@lca.pw>
Link: https://lkml.kernel.org/r/20200903183029.14930-4-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-10 13:37:15 +02:00
Mina Almasry 14fd3da3e8 mm, hugetlb: fix simple resv_huge_pages underflow on UFFDIO_COPY
[ Upstream commit d84cf06e3dd8c5c5b547b5d8931015fc536678e5 ]

The userfaultfd hugetlb tests cause a resv_huge_pages underflow.  This
happens when hugetlb_mcopy_atomic_pte() is called with !is_continue on
an index for which we already have a page in the cache.  When this
happens, we allocate a second page, double consuming the reservation,
and then fail to insert the page into the cache and return -EEXIST.

To fix this, we first check if there is a page in the cache which
already consumed the reservation, and return -EEXIST immediately if so.

There is still a rare condition where we fail to copy the page contents
AND race with a call for hugetlb_no_page() for this index and again we
will underflow resv_huge_pages.  That is fixed in a more complicated
patch not targeted for -stable.

Test:

  Hacked the code locally such that resv_huge_pages underflows produce a
  warning, then:

  ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10
	2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
  ./tools/testing/selftests/vm/userfaultfd hugetlb 10
	2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success

Both tests succeed and produce no warnings.  After the test runs number
of free/resv hugepages is correct.

[mike.kravetz@oracle.com: changelog fixes]

Link: https://lkml.kernel.org/r/20210528004649.85298-1-almasrymina@google.com
Fixes: 8fb5debc5f ("userfaultfd: hugetlbfs: add hugetlb_mcopy_atomic_pte for userfaultfd support")
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-10 13:37:14 +02:00
Andrey Zhizhikin 2cbb55e591 This is the 5.4.120 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmCkyEcACgkQONu9yGCS
 aT70Qg//Rv09McvLQ+8E0OilJ7TdT0UthXQFP+uPTu+/HPeHQkCO168cn1hbwD9K
 i0YfFYB7PqPe/wccHNsmWHSUYCzA9NnwExA84/jofjswkEMMc95x/bow5/xmLe/5
 ImkjODPVHuQWewgMfbSmNu7Br4wmQC5U/K4r7hp/Aa0FdTjcHMI6Zw40FGbJrWmq
 kiqhW9CeagKbxWrihQNLrSB4E5CdpNNkug/zVus2n9jlFT4tltNGSd7bPsxrp7LN
 EdTfayyPUVZeCoysTNA0WZgz47f+z47vAdIlDHzWCIOZcM1RnJXKA5kFXRf8Fnfa
 +hyvaHSDqYGdRgZxYMcXLL+/cS4foQ/8iQxZBCMomABM0MNUuoJ5tYR6GVetlRcR
 46ZC/5OAvNoKY2Kj4Ky4ROF7aMR3NkYCY6wHUVRcw8778bmuReeLJJPsWojAI+4F
 pWT08+7OUJYb3hRnGxxzKot6CPztdkpQXfXMy+wyNlNbRZ/ivs9/f/GhdblXy/6T
 j12LKIh1IOxpB/wi7GRfeABUuC4MU8xqx6FuPDrBgCTMfVig/wcwF27AUr//a0F5
 xrrzCrDFNAvuyD1WyYilaxWDHAe2o9ROT0JZ4VB3zu40w2VlTT77aqA174xfQa6b
 418Eykw3O11dmsY8AQPTt1HhkDCiewEe4K58CJcmCNEf/inFbvI=
 =kNQc
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmCk3VMACgkQ7G51OISz
 Hs0Y+A/9HqeZOFBPE3BlnEKkdxEPp+zoOj/57pWlS+Xr8ySalZXonAAC975voKjS
 6IFejLkN83q5yCrwYBJ/PCc5Xm9deBf+6AE3KyaNV7CD5ycSBy8o9Fx7IH81wFXt
 quiVNlpi7iGiWsm3ubsvDHNgHwQlvN6mbNp+RIqD4YllheUsTXeR05rrfr+TauXv
 Bv7LEolcyUJo0LTzqPsTlVFG9ec+0qWZb1A7LgOZJFWrS/XaYpoXA49KaNbIZIPj
 jb+sWzTsXhOMc2yB1A6P8T3jsx7NxK4Y2rRyCk0l5Hm//Jl7Kcx+vbe0yQPf2DOs
 c0X+0s8Vwk/Ry606XLcy0nIIGePyj++u43r6noB/cN/LnZUbJY3zbywUNvrY1thS
 YG2MtSiV9TzzKP4xApPF7G/4G6H4HOaC1cQbxc3+7sklJf7Coh3pKv775POIfnLS
 UrQ2ToQRNWQC4/l/O9h9BYiIPWSAAUd0uBgZzcdQWeWbdPXMdb02cwZ5YuPoZgSV
 aLsDN8naaQlsZ20wdQ0DaqKoIryHnD0XHNSMkEvZ0bjPxOOtICekixLJQxjwFBQL
 cx3stcQs5PT8l3PQqc2xnVNTqnL7yWGdMYdfNE+4qTLSiw2s5uXbb7BJwttj9Pit
 QKVSVUtp12ObX1kJzqhKa9GQEUPYREVBpN7NwJL2E5hVObfT6OE=
 =ruSa
 -----END PGP SIGNATURE-----

Merge tag 'v5.4.120' into 5.4-2.3.x-imx

This is the 5.4.120 stable release

Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>
2021-05-19 09:41:35 +00:00
Peter Xu f77aa56ad9 mm/hugetlb: fix F_SEAL_FUTURE_WRITE
commit 22247efd822e6d263f3c8bd327f3f769aea9b1d9 upstream.

Patch series "mm/hugetlb: Fix issues on file sealing and fork", v2.

Hugh reported issue with F_SEAL_FUTURE_WRITE not applied correctly to
hugetlbfs, which I can easily verify using the memfd_test program, which
seems that the program is hardly run with hugetlbfs pages (as by default
shmem).

Meanwhile I found another probably even more severe issue on that hugetlb
fork won't wr-protect child cow pages, so child can potentially write to
parent private pages.  Patch 2 addresses that.

After this series applied, "memfd_test hugetlbfs" should start to pass.

This patch (of 2):

F_SEAL_FUTURE_WRITE is missing for hugetlb starting from the first day.
There is a test program for that and it fails constantly.

$ ./memfd_test hugetlbfs
memfd-hugetlb: CREATE
memfd-hugetlb: BASIC
memfd-hugetlb: SEAL-WRITE
memfd-hugetlb: SEAL-FUTURE-WRITE
mmap() didn't fail as expected
Aborted (core dumped)

I think it's probably because no one is really running the hugetlbfs test.

Fix it by checking FUTURE_WRITE also in hugetlbfs_file_mmap() as what we
do in shmem_mmap().  Generalize a helper for that.

Link: https://lkml.kernel.org/r/20210503234356.9097-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210503234356.9097-2-peterx@redhat.com
Fixes: ab3948f58f ("mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-19 10:08:29 +02:00
Axel Rasmussen b3f1731c6d userfaultfd: release page in error path to avoid BUG_ON
commit 7ed9d238c7dbb1fdb63ad96a6184985151b0171c upstream.

Consider the following sequence of events:

1. Userspace issues a UFFD ioctl, which ends up calling into
   shmem_mfill_atomic_pte(). We successfully account the blocks, we
   shmem_alloc_page(), but then the copy_from_user() fails. We return
   -ENOENT. We don't release the page we allocated.
2. Our caller detects this error code, tries the copy_from_user() after
   dropping the mmap_lock, and retries, calling back into
   shmem_mfill_atomic_pte().
3. Meanwhile, let's say another process filled up the tmpfs being used.
4. So shmem_mfill_atomic_pte() fails to account blocks this time, and
   immediately returns - without releasing the page.

This triggers a BUG_ON in our caller, which asserts that the page
should always be consumed, unless -ENOENT is returned.

To fix this, detect if we have such a "dangling" page when accounting
fails, and if so, release it before returning.

Link: https://lkml.kernel.org/r/20210428230858.348400-1-axelrasmussen@google.com
Fixes: cb658a453b ("userfaultfd: shmem: avoid leaking blocks and used blocks in UFFDIO_COPY")
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reported-by: Hugh Dickins <hughd@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-19 10:08:29 +02:00
Miaohe Lin 6aeba28d12 ksm: fix potential missing rmap_item for stable_node
[ Upstream commit c89a384e2551c692a9fe60d093fd7080f50afc51 ]

When removing rmap_item from stable tree, STABLE_FLAG of rmap_item is
cleared with head reserved.  So the following scenario might happen: For
ksm page with rmap_item1:

cmp_and_merge_page
  stable_node->head = &migrate_nodes;
  remove_rmap_item_from_tree, but head still equal to stable_node;
  try_to_merge_with_ksm_page failed;
  return;

For the same ksm page with rmap_item2, stable node migration succeed this
time.  The stable_node->head does not equal to migrate_nodes now.  For ksm
page with rmap_item1 again:

cmp_and_merge_page
 stable_node->head != &migrate_nodes && rmap_item->head == stable_node
 return;

We would miss the rmap_item for stable_node and might result in failed
rmap_walk_ksm().  Fix this by set rmap_item->head to NULL when rmap_item
is removed from stable tree.

Link: https://lkml.kernel.org/r/20210330140228.45635-5-linmiaohe@huawei.com
Fixes: 4146d2d673 ("ksm: make !merge_across_nodes migration safe")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-19 10:08:27 +02:00
Miaohe Lin dde73137ce mm/migrate.c: fix potential indeterminate pte entry in migrate_vma_insert_page()
[ Upstream commit 34f5e9b9d1990d286199084efa752530ee3d8297 ]

If the zone device page does not belong to un-addressable device memory,
the variable entry will be uninitialized and lead to indeterminate pte
entry ultimately.  Fix this unexpected case and warn about it.

Link: https://lkml.kernel.org/r/20210325131524.48181-4-linmiaohe@huawei.com
Fixes: df6ad69838 ("mm/device-public-memory: device memory cache coherent with CPU")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-19 10:08:27 +02:00