linux-brain

Commit Graph

Author	SHA1	Message	Date
Daniel Borkmann	283d742988	bpf: Inherit expanded/patched seen count from old aux data commit d203b0fd863a2261e5d00b97f3d060c4c2a6db71 upstream Instead of relying on current env->pass_cnt, use the seen count from the old aux data in adjust_insn_aux_data(), and expand it to the new range of patched instructions. This change is valid given we always expand 1:n with n>=1, so what applies to the old/original instruction needs to apply for the replacement as well. Not relying on env->pass_cnt is a prerequisite for a later change where we want to avoid marking an instruction seen when verified under speculative execution path. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Reviewed-by: Benedict Schlueter <benedict.schlueter@rub.de> Reviewed-by: Piotr Krysiuk <piotras@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> [OP: declare old_data as bool instead of u32 (struct bpf_insn_aux_data.seen is bool in 5.4)] Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-08-08 09:04:08 +02:00
Andrey Zhizhikin	a6acc71480	This is the 5.4.137 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmEE64UACgkQONu9yGCS aT43BA/7BbeM1RL4UmHcsqTvk3m3nXyGCw/5v9c3JZflmfmfG1H/bbeeHpRs28jL MCzZxVHakxH2MpQxxzPyy7ZD1uAFe2GFXNPoHtfVTyFRvrIQRKWygFCiqeOKnato gRlzPklzO21b+YaiyV+53vG7q0K+kSz7/J2NY8jWSDNCDLOJjBMt0BsSMdq4VyRb R2dsoHAw7ifDUPrMk41xoWdQrYweXV4ebWnKS88wrFicczz5WTNAWu9YnpePzFFn lQCpgCy1rc/64zvJOyHw8Ou7V3dcWtYpVM0iAH1T4j7St7nyDokcZ1BzIxKSklTd QZPncyLszTN/UGGwFgFw4qizGzsothQDmEdQOWtVZBPbfDqntbZJO+a9jkwdfB7H E251/e1UaeyhzEshiYPCSdJEtT945ZDhJerQQZk1yMxUy1b8HobHL8P+Ce/uGypT 6yux9fKpWZJMFN0Su8G2exJcDXFgwiciGxD9oF7Iuo1++6gIrgfizSDLga8QPbub x6/YcoWU32KZ289AyvhCQPsPSh8MQntNz5XiiTNcsS1+/7kcBVtVStH67O/tbPZz lJc2G0lYeYe2SFQvJlmLruD690isKslEr5d3csieWco6+ey5h7YF6hLMLS1BjBOL /Hq2AJj72qDFOh5Dq+zPo2oJhWm2j9Am6REE4btDhOyjLB6YJN8= =8nQ8 -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmEKo7AACgkQ7G51OISz Hs1S1RAAqc59rVcx1ijhBhPY+71ayIUxG5tvjrDgiPQBhQ04Gt0IPPBYsj0dZZAL 8wKZJkMNAVO03bPiBIVArkZzGoGhvIF0BaOq1uYXYAyi6jRewbclIrNEKOLM5DU9 CqQr6gix2twUUdg2G0WGRDyLV/WyM1qPifzVRnYdvAdkkJkQV25V8RuQazPdPQ4e c/Inwhx6JHg/35XsH9VCf5uDibT9+dKvFXCr1gcC9tWV2xuqfxMYm1Z/YixhGG3l f4ZI6SQpXVYwV+Fc+jqZWCPVN9rQFGSCLuMZ84TlM0aiOnxdGruiSlEq/6e4XucH MC7HaI5qDENgosx/KJ1K5Hr+CgwwQDPUC9UR3oNvgKOnLR/V7aEVy5IiZdStKv2I nvXFpfK7251Qxr9A7awr0aiuMRAjisiXRqy7M1S0knmdV9AqMb0pNEB8VjIlZy7Z hkq3JUi5UTiimNUSAZdXAzmM4ay3Auv/aerQZrDg3ii2tL4YtLpaCRohBkNBc2g2 i103CYCIylsUJTEPELK1dzS1ZQjP8Dkavo7X9qh3mfxD4+u/XzR5UedLu3ITcmdw OTvg3zFWAPCiE55Pl252Arnjk/5kbb3KI2RuuMbaqJPmxRXiLlzrOPo+6sODFWnd om50F55sS21WQm0MjgRyNchhnYVELnYyutWwJ9M8LnJ3iY7lyyU= =6zUh -----END PGP SIGNATURE----- Merge tag 'v5.4.137' into 5.4-2.3.x-imx This is the 5.4.137 stable release Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>	2021-08-04 14:26:53 +00:00
Andrey Zhizhikin	d71473b588	This is the 5.4.136 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmEBQAcACgkQONu9yGCS aT4FRBAAgFrHSPHhtwcZ2uqAehzajAp7AbKxf1WejxPg/0YH2bE6nbhuLyDWqH5F mhyDpXVltW7xaFYZAEg9CPr6czwHAul4Bql4DH57KbO+/Q5BrS0VguepP0TPcVI5 H8KztBrJCL5TsrOsvB+EXHtqDkEuhX957Qwa6PkBJs12x2Vq3EmazGGKSZSCGKuy v5gM8wztC3NzzOhVDZ2MPbh8RTrbGUEaRFi6B/XNlcEWMAxyqDJlJInbzimIFL6T eOYZ7z+IdrV0I0Eq0tqUmnhONQZxscs/hX1yv7evZtfG7LbT3v4nJu7c6O4FnLwV 61B5aK4aytX7rTLVU+FRxP7MTmvNit71AY8SMSOx+bNLGBtrFstMv+f950j8npq1 683wCAlDD2hw3zOc6rzbXhdowKtIaFirqDEDiYOy/K5r0liaEtQboOmlBO2WDFYy q5HsoCIpNWH2Os4LlA3PYVChEzO5yQJksUgRgUhcNMA0y+8hE1/C91HxNy8HPyHf tIeRHIpdvHETzSbNIYe9b9iQK0f3S2YLI+sdMtrlEXYFpvlD/w2DsVlzr/IRKP1x N1LVskeB7PVzJEImZPTGVrbPu/a/FHtFpx3dgiST72t18rHgCFdxW7pCI05jegLr C72SSES2v3QIIRoPAO6NF/E8ltmT6lnor1AcNeGz5I4rvPB01u8= =pPb8 -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmEKo6UACgkQ7G51OISz Hs2X4Q/8CD6BDb4N5T1HeiFyZROk2IZK/PU80vpSTUBwn6x/27xKP0nisApawgqP 5T1Dnucgp42H9cMRnjHikiDa08i0tAoyzwIHJG+1/DG2iHpGfRo9Iu+BZyIWaK2E c+VOr1AHQw2iC9QSmHF4sbNFwBdJNLNXnR3od2AUH/G0gRGUEjchN8rgkQJh3pmO 6RHddTdrXLP31EOc0LEH2pNK8E49e6Ipo/5OY4lc3b0BxF9lhCGLbL8e+E5gTRAF eIpwhKAnqiRHBPjzCMyZPzAfdUQwywk4gnPxy2NTq8O7vSY9NVxmHMAu29/duIK0 UQbW9q1Vv+BIyXUimawh/cxoouRpV3Owue2p21nCIRVS2v0Wo4c98PTvxBgEy+UR MBMVeb1I9XwhPS9SLABADfn7mz9BAWLb+YVkbQHFgMZ3kHT3bT1qe8EbT+VXyeBd 2pviLgXKCsVwJQHxHv2GAJcyLoDhMynFRdaIxa/7CoPadAH3Jj/t5K8frWP5+Cbj iLVJW65S9zQRvHqkd6sOU17l/zHOF0AB9WgVS3PhO2nIxC74ZlSJr/ATe9yB69LU JlDDxZRHF2QsI/07IFw5t91ex+wgUNGMBnTan+hmZh4xoC8Syl+nGfbKZqn9Qhhk bv7Sk2FZTmPlsvFRASEnHINyYg8nt4vaHAz5HBOBjJhyJXKK4cE= =YNbT -----END PGP SIGNATURE----- Merge tag 'v5.4.136' into 5.4-2.3.x-imx This is the 5.4.136 stable release Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>	2021-08-04 14:26:43 +00:00
Andrey Zhizhikin	90c98361bb	This is the 5.4.135 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmD9WokACgkQONu9yGCS aT5RTg/+KOmvPPq4DTSRwQqC7Zk1TzPUQ38H2iZxgpISds7Y0S3RKFmJvXcRoxe2 z0y6b1XErmVvamAlULFEYMxkmpwAiUeO137UqJN/kwyybvEejrAKDiv9kOMcEwh9 zKPfrDQ9UQVbInSMsjQrzaME1voYzdUfhd10vGCxFjQl4RFRy06Fj0SfRmsZeeB+ geu5F6xnba5+IW07okT4FTAsMYPqc+PyP/sENiXQPHt43uSNMQTRdLCh0+7slJ0b Lr9S/euozG8L3wYrs7AUFPaMLDvaQoh4k2mp5oXk8MYYrmKWrLo3e7ZNxBptxjd8 NmwfG9WWfCp4LpN8fMnhrUQxkIj+paDTg9ir1bKmpJwm81miXlWazTQHCw1Mige1 u03P9Q0tUQP3khpVSEE583RLjr8NKR/zkXx97KTL54GsFmwSe4QdbXX3ZlVYj4md FN/8MBBqITNOwm4akObRN4ppOCSD+Qp5a94JOXqmmZ36u+wicAB7SZgVZq6PAmXv kQEYxkS0EALLyzMuK5DBB5zcEq6oT/9Gtr107An1gFGj1hqd1NeV0xPguSxUJLE8 GEL2M9s5jyjbqFZHiz3hPDMB5SKY0T6y8sGtKNmAM6woaLxoRp++JcR/U8m3PpD/ wJ432zHfi6ERp9WsAhyiYpijMj+xU3gCeo8JIP5vsQnaFtvqev8= =qauz -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmEKo44ACgkQ7G51OISz Hs0izw//RZp1Hn1xQTToi8PHof/qNviZESLMuhjtxXftG4bX1PZqvKDtBTYudo6+ hsXyjHma/IFyRcNmzqookE1Fli5mrEm0FIdkyxfOTDur7JdRiTfDle7K2Gej0Maq DKuUO2qlXIxKZwe9YmPNKg+ZzFdlMmhdz6rCbAlumt859zErGK/1YLTqDZL4aiGS RZ43eY2BisU23JHbfIyVdvT4xdgL7vB4uadC7WIoM1WXTH/sv6VPd3rIC7oeAGBR q5/D1yfWvV6uyyX60WJnRH2vEUwv35UdNQkrIiFQ7SzonhbJbkE+ZL481g2IfZ3S OwdA2GMn/LE8+Q+IHtoISnUiyw8n7Lae69COHxUIcmggjIGSw5S5Bqoc4OuVw2dv BHICUux3IYwhHNv5Py3CNKiVLg9tKAvFoScrwofV5ToD/pgEBBjtbB0+OIoXtdMp yQVo/CKuiwwIDTrU1FpVC4rt90gS7EErpjOr/QG8paXMHiMxyhAPnBGLr9SPaueD LTXI3ZWNz+ZOFBLH34LZOMdyuWGNbQjwvi86Z5DuCaFL4ZXGAWVl5OUVF7oUyGkL vtgXfh6nzrsVoTBC7tfsuXuFossrTSvlpPtj2t2SB9hQEohE0pL6mS41inuJa4gP b6b0XRtazzskKT4ApEOoaNqlu0ZnDxC/xTdZN9nC5IZ/Mp+BrPQ= =603y -----END PGP SIGNATURE----- Merge tag 'v5.4.135' into 5.4-2.3.x-imx This is the 5.4.135 stable release Conflicts (manual resolve): - drivers/usb/cdns3/gadget.c: Use NXP version, as upstream commit `f53729b828` ("usb: cdns3: Enable TDL_CHK only for OUT ep") is already applied. - arch/arm64/boot/dts/freescale/imx8mq.dtsi: Merge upstream commit `556cf02830` ("arm64: dts: imx8mq: assign PCIe clocks") manually into NXP tree. Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>	2021-08-04 14:25:42 +00:00
Paul Gortmaker	eef99860c6	cgroup1: fix leaked context root causing sporadic NULL deref in LTP commit 1e7107c5ef44431bc1ebbd4c353f1d7c22e5f2ec upstream. Richard reported sporadic (roughly one in 10 or so) null dereferences and other strange behaviour for a set of automated LTP tests. Things like: BUG: kernel NULL pointer dereference, address: 0000000000000008 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 1516 Comm: umount Not tainted 5.10.0-yocto-standard #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014 RIP: 0010:kernfs_sop_show_path+0x1b/0x60 ...or these others: RIP: 0010:do_mkdirat+0x6a/0xf0 RIP: 0010:d_alloc_parallel+0x98/0x510 RIP: 0010:do_readlinkat+0x86/0x120 There were other less common instances of some kind of a general scribble but the common theme was mount and cgroup and a dubious dentry triggering the NULL dereference. I was only able to reproduce it under qemu by replicating Richard's setup as closely as possible - I never did get it to happen on bare metal, even while keeping everything else the same. In commit `71d883c37e` ("cgroup_do_mount(): massage calling conventions") we see this as a part of the overall change: -------------- struct cgroup_subsys ss; - struct dentry dentry; [...] - dentry = cgroup_do_mount(&cgroup_fs_type, fc->sb_flags, root, - CGROUP_SUPER_MAGIC, ns); [...] - if (percpu_ref_is_dying(&root->cgrp.self.refcnt)) { - struct super_block sb = dentry->d_sb; - dput(dentry); + ret = cgroup_do_mount(fc, CGROUP_SUPER_MAGIC, ns); + if (!ret && percpu_ref_is_dying(&root->cgrp.self.refcnt)) { + struct super_block sb = fc->root->d_sb; + dput(fc->root); deactivate_locked_super(sb); msleep(10); return restart_syscall(); } -------------- In changing from the local "*dentry" variable to using fc->root, we now export/leave that dentry pointer in the file context after doing the dput() in the unlikely "is_dying" case. With LTP doing a crazy amount of back to back mount/unmount [testcases/bin/cgroup_regression_5_1.sh] the unlikely becomes slightly likely and then bad things happen. A fix would be to not leave the stale reference in fc->root as follows: -------------- dput(fc->root); + fc->root = NULL; deactivate_locked_super(sb); -------------- ...but then we are just open-coding a duplicate of fc_drop_locked() so we simply use that instead. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: stable@vger.kernel.org # v5.1+ Reported-by: Richard Purdie <richard.purdie@linuxfoundation.org> Fixes: `71d883c37e` ("cgroup_do_mount(): massage calling conventions") Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-07-31 08:19:37 +02:00
Yang Yingliang	7f0365b4da	workqueue: fix UAF in pwq_unbound_release_workfn() commit b42b0bddcbc87b4c66f6497f66fc72d52b712aa7 upstream. I got a UAF report when doing fuzz test: [ 152.880091][ T8030] ================================================================== [ 152.881240][ T8030] BUG: KASAN: use-after-free in pwq_unbound_release_workfn+0x50/0x190 [ 152.882442][ T8030] Read of size 4 at addr ffff88810d31bd00 by task kworker/3:2/8030 [ 152.883578][ T8030] [ 152.883932][ T8030] CPU: 3 PID: 8030 Comm: kworker/3:2 Not tainted 5.13.0+ #249 [ 152.885014][ T8030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 [ 152.886442][ T8030] Workqueue: events pwq_unbound_release_workfn [ 152.887358][ T8030] Call Trace: [ 152.887837][ T8030] dump_stack_lvl+0x75/0x9b [ 152.888525][ T8030] ? pwq_unbound_release_workfn+0x50/0x190 [ 152.889371][ T8030] print_address_description.constprop.10+0x48/0x70 [ 152.890326][ T8030] ? pwq_unbound_release_workfn+0x50/0x190 [ 152.891163][ T8030] ? pwq_unbound_release_workfn+0x50/0x190 [ 152.891999][ T8030] kasan_report.cold.15+0x82/0xdb [ 152.892740][ T8030] ? pwq_unbound_release_workfn+0x50/0x190 [ 152.893594][ T8030] __asan_load4+0x69/0x90 [ 152.894243][ T8030] pwq_unbound_release_workfn+0x50/0x190 [ 152.895057][ T8030] process_one_work+0x47b/0x890 [ 152.895778][ T8030] worker_thread+0x5c/0x790 [ 152.896439][ T8030] ? process_one_work+0x890/0x890 [ 152.897163][ T8030] kthread+0x223/0x250 [ 152.897747][ T8030] ? set_kthread_struct+0xb0/0xb0 [ 152.898471][ T8030] ret_from_fork+0x1f/0x30 [ 152.899114][ T8030] [ 152.899446][ T8030] Allocated by task 8884: [ 152.900084][ T8030] kasan_save_stack+0x21/0x50 [ 152.900769][ T8030] __kasan_kmalloc+0x88/0xb0 [ 152.901416][ T8030] __kmalloc+0x29c/0x460 [ 152.902014][ T8030] alloc_workqueue+0x111/0x8e0 [ 152.902690][ T8030] __btrfs_alloc_workqueue+0x11e/0x2a0 [ 152.903459][ T8030] btrfs_alloc_workqueue+0x6d/0x1d0 [ 152.904198][ T8030] scrub_workers_get+0x1e8/0x490 [ 152.904929][ T8030] btrfs_scrub_dev+0x1b9/0x9c0 [ 152.905599][ T8030] btrfs_ioctl+0x122c/0x4e50 [ 152.906247][ T8030] __x64_sys_ioctl+0x137/0x190 [ 152.906916][ T8030] do_syscall_64+0x34/0xb0 [ 152.907535][ T8030] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 152.908365][ T8030] [ 152.908688][ T8030] Freed by task 8884: [ 152.909243][ T8030] kasan_save_stack+0x21/0x50 [ 152.909893][ T8030] kasan_set_track+0x20/0x30 [ 152.910541][ T8030] kasan_set_free_info+0x24/0x40 [ 152.911265][ T8030] __kasan_slab_free+0xf7/0x140 [ 152.911964][ T8030] kfree+0x9e/0x3d0 [ 152.912501][ T8030] alloc_workqueue+0x7d7/0x8e0 [ 152.913182][ T8030] __btrfs_alloc_workqueue+0x11e/0x2a0 [ 152.913949][ T8030] btrfs_alloc_workqueue+0x6d/0x1d0 [ 152.914703][ T8030] scrub_workers_get+0x1e8/0x490 [ 152.915402][ T8030] btrfs_scrub_dev+0x1b9/0x9c0 [ 152.916077][ T8030] btrfs_ioctl+0x122c/0x4e50 [ 152.916729][ T8030] __x64_sys_ioctl+0x137/0x190 [ 152.917414][ T8030] do_syscall_64+0x34/0xb0 [ 152.918034][ T8030] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 152.918872][ T8030] [ 152.919203][ T8030] The buggy address belongs to the object at ffff88810d31bc00 [ 152.919203][ T8030] which belongs to the cache kmalloc-512 of size 512 [ 152.921155][ T8030] The buggy address is located 256 bytes inside of [ 152.921155][ T8030] 512-byte region [ffff88810d31bc00, ffff88810d31be00) [ 152.922993][ T8030] The buggy address belongs to the page: [ 152.923800][ T8030] page:ffffea000434c600 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10d318 [ 152.925249][ T8030] head:ffffea000434c600 order:2 compound_mapcount:0 compound_pincount:0 [ 152.926399][ T8030] flags: 0x57ff00000010200(slab\|head\|node=1\|zone=2\|lastcpupid=0x7ff) [ 152.927515][ T8030] raw: 057ff00000010200 dead000000000100 dead000000000122 ffff888009c42c80 [ 152.928716][ T8030] raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000 [ 152.929890][ T8030] page dumped because: kasan: bad access detected [ 152.930759][ T8030] [ 152.931076][ T8030] Memory state around the buggy address: [ 152.931851][ T8030] ffff88810d31bc00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 152.932967][ T8030] ffff88810d31bc80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 152.934068][ T8030] >ffff88810d31bd00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 152.935189][ T8030] ^ [ 152.935763][ T8030] ffff88810d31bd80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 152.936847][ T8030] ffff88810d31be00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 152.937940][ T8030] ================================================================== If apply_wqattrs_prepare() fails in alloc_workqueue(), it will call put_pwq() which invoke a work queue to call pwq_unbound_release_workfn() and use the 'wq'. The 'wq' allocated in alloc_workqueue() will be freed in error path when apply_wqattrs_prepare() fails. So it will lead a UAF. CPU0 CPU1 alloc_workqueue() alloc_and_link_pwqs() apply_wqattrs_prepare() fails apply_wqattrs_cleanup() schedule_work(&pwq->unbound_release_work) kfree(wq) worker_thread() pwq_unbound_release_workfn() <- trigger uaf here If apply_wqattrs_prepare() fails, the new pwq are not linked, it doesn't hold any reference to the 'wq', 'wq' is invalid to access in the worker, so add check pwq if linked to fix this. Fixes: `2d5f0764b5` ("workqueue: split apply_workqueue_attrs() into 3 stages") Cc: stable@vger.kernel.org # v4.2+ Reported-by: Hulk Robot <hulkci@huawei.com> Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Tested-by: Pavel Skripkin <paskripkin@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-07-31 08:19:37 +02:00
Haoran Luo	f899f24d34	tracing: Fix bug in rb_per_cpu_empty() that might cause deadloop. commit 67f0d6d9883c13174669f88adac4f0ee656cc16a upstream. The "rb_per_cpu_empty()" misinterpret the condition (as not-empty) when "head_page" and "commit_page" of "struct ring_buffer_per_cpu" points to the same buffer page, whose "buffer_data_page" is empty and "read" field is non-zero. An error scenario could be constructed as followed (kernel perspective): 1. All pages in the buffer has been accessed by reader(s) so that all of them will have non-zero "read" field. 2. Read and clear all buffer pages so that "rb_num_of_entries()" will return 0 rendering there's no more data to read. It is also required that the "read_page", "commit_page" and "tail_page" points to the same page, while "head_page" is the next page of them. 3. Invoke "ring_buffer_lock_reserve()" with large enough "length" so that it shot pass the end of current tail buffer page. Now the "head_page", "commit_page" and "tail_page" points to the same page. 4. Discard current event with "ring_buffer_discard_commit()", so that "head_page", "commit_page" and "tail_page" points to a page whose buffer data page is now empty. When the error scenario has been constructed, "tracing_read_pipe" will be trapped inside a deadloop: "trace_empty()" returns 0 since "rb_per_cpu_empty()" returns 0 when it hits the CPU containing such constructed ring buffer. Then "trace_find_next_entry_inc()" always return NULL since "rb_num_of_entries()" reports there's no more entry to read. Finally "trace_seq_to_user()" returns "-EBUSY" spanking "tracing_read_pipe" back to the start of the "waitagain" loop. I've also written a proof-of-concept script to construct the scenario and trigger the bug automatically, you can use it to trace and validate my reasoning above: https://github.com/aegistudio/RingBufferDetonator.git Tests has been carried out on linux kernel 5.14-rc2 (2734d6c1b1a089fb593ef6a23d4b70903526fe0c), my fixed version of kernel (for testing whether my update fixes the bug) and some older kernels (for range of affected kernels). Test result is also attached to the proof-of-concept repository. Link: https://lore.kernel.org/linux-trace-devel/YPaNxsIlb2yjSi5Y@aegistudio/ Link: https://lore.kernel.org/linux-trace-devel/YPgrN85WL9VyrZ55@aegistudio Cc: stable@vger.kernel.org Fixes: `bf41a158ca` ("ring-buffer: make reentrant") Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Signed-off-by: Haoran Luo <www@aegistudio.net> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-07-28 13:31:00 +02:00
Steven Rostedt (VMware)	59a9f75fb2	tracing/histogram: Rename "cpu" to "common_cpu" commit 1e3bac71c5053c99d438771fc9fa5082ae5d90aa upstream. Currently the histogram logic allows the user to write "cpu" in as an event field, and it will record the CPU that the event happened on. The problem with this is that there's a lot of events that have "cpu" as a real field, and using "cpu" as the CPU it ran on, makes it impossible to run histograms on the "cpu" field of events. For example, if I want to have a histogram on the count of the workqueue_queue_work event on its cpu field, running: ># echo 'hist:keys=cpu' > events/workqueue/workqueue_queue_work/trigger Gives a misleading and wrong result. Change the command to "common_cpu" as no event should have "common_*" fields as that's a reserved name for fields used by all events. And this makes sense here as common_cpu would be a field used by all events. Now we can even do: ># echo 'hist:keys=common_cpu,cpu if cpu < 100' > events/workqueue/workqueue_queue_work/trigger ># cat events/workqueue/workqueue_queue_work/hist # event histogram # # trigger info: hist:keys=common_cpu,cpu:vals=hitcount:sort=hitcount:size=2048 if cpu < 100 [active] # { common_cpu: 0, cpu: 2 } hitcount: 1 { common_cpu: 0, cpu: 4 } hitcount: 1 { common_cpu: 7, cpu: 7 } hitcount: 1 { common_cpu: 0, cpu: 7 } hitcount: 1 { common_cpu: 0, cpu: 1 } hitcount: 1 { common_cpu: 0, cpu: 6 } hitcount: 2 { common_cpu: 0, cpu: 5 } hitcount: 2 { common_cpu: 1, cpu: 1 } hitcount: 4 { common_cpu: 6, cpu: 6 } hitcount: 4 { common_cpu: 5, cpu: 5 } hitcount: 14 { common_cpu: 4, cpu: 4 } hitcount: 26 { common_cpu: 0, cpu: 0 } hitcount: 39 { common_cpu: 2, cpu: 2 } hitcount: 184 Now for backward compatibility, I added a trick. If "cpu" is used, and the field is not found, it will fall back to "common_cpu" and work as it did before. This way, it will still work for old programs that use "cpu" to get the actual CPU, but if the event has a "cpu" as a field, it will get that event's "cpu" field, which is probably what it wants anyway. I updated the tracefs/README to include documentation about both the common_timestamp and the common_cpu. This way, if that text is present in the README, then an application can know that common_cpu is supported over just plain "cpu". Link: https://lkml.kernel.org/r/20210721110053.26b4f641@oasis.local.home Cc: Namhyung Kim <namhyung@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@vger.kernel.org Fixes: `8b7622bf94` ("tracing: Add cpu field for hist triggers") Reviewed-by: Tom Zanussi <zanussi@kernel.org> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-07-28 13:31:00 +02:00
Odin Ugedal	2a47e0719a	sched/fair: Fix CFS bandwidth hrtimer expiry type [ Upstream commit 72d0ad7cb5bad265adb2014dbe46c4ccb11afaba ] The time remaining until expiry of the refresh_timer can be negative. Casting the type to an unsigned 64-bit value will cause integer underflow, making the runtime_refresh_within return false instead of true. These situations are rare, but they do happen. This does not cause user-facing issues or errors; other than possibly unthrottling cfs_rq's using runtime from the previous period(s), making the CFS bandwidth enforcement less strict in those (special) situations. Signed-off-by: Odin Ugedal <odin@uged.al> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Link: https://lore.kernel.org/r/20210629121452.18429-1-odin@uged.al Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-07-25 14:35:13 +02:00
Andrey Zhizhikin	af4a69fcb7	This is the 5.4.134 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmD22Z0ACgkQONu9yGCS aT7fWQ//TgR/TwyqFf3jFQ75gA4/9keirtaZaFd42CUWJQQeyUTc3YcoFV6yQ/fi letGckFWKxAK/ftqcla9yJemLmXzt7Hx4wpsF+5r97YLN7Yv5M9CznqdATgrg+pY dfRnJJZhAs1gnDVzSiBfHlgxCCHw/7UEOs7o+slBSU/KKgGaF+H542i2v+n2XIWV kS3J/K2fZN8GF6o0L6Wzpvd5/8xjqs4Nm883ZTQ8B/dR0Cxz9sLW0EAjK4THcmFM 7riUT4GewvsohddsaGyQ2akieHiZLbGyxtola72wSpoNwiy/tYLpBpKlJNFwyeIr DTBgreoB9NGikQfEZrHjzvUvK6KHXkz0h3pprjXOKFUnu7uVm9OFUw5scAq/Wh0Q qe9nB4xCiZqxJJTEVCa499ZiBDvS6J0kbs5fge3reIEfno5I4LdlJJTkpWgV9Flh dIninifXmGYtaLrTkWJzAydSntt0Z+Zk/y9aS08jwHVpgs1HXVXr0xEqlxr1/5su K5fiSGXXhYSbFuAvW8tYIJickXvM8xBVpOkkiqLuhxts2kiZp8TJn2H/DiQBekJt CQeUvNQiEbwI/NFPiJQrtzQHC4Gq2Ibv9+MpnQvOW/RxHH7NSJ4vOp8FgaHBxIZE Jwa3NPHG8bzCo/n+vRn00Re1+OGYzsphxO2hRp+EjhNxKzpxwkc= =3pL6 -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmD3ITMACgkQ7G51OISz Hs2u9w/+LaT+Iko2MBP8bc0yqq9HYxObAHLWdC0oAYnygO+7MQSldaDlAtLDvfr8 d6IdCjPfFPHLWFb/Ggjx7dH6PhYwj+sarlljeXSFriEro2994ig0C75J83hOkks6 d9OeONg2PHivigrXLt+00Zdd4Y9Ubbo3qStBeWaaezCfZtMulcgrCBIqQ0CPi3l4 iocX7sgmGY51fypoh75+7SpfZOTDKGigzTWcTN/chMVGu4Jy+hUpnNMYWSw2sOvD VdY460C6mMmH/UPiccwM01Mo7z75Psudn1K6Ic/c+FzOH1hCydvNIebfBuo3nJP7 DEIAR4wz6AXoOgcU1hnlsvoBR9x5CNR871sAA9HwJj36XwLaJ5RfOIqESS+odwbl Xr8MuE/F3wu8gtz/4LqSjuFDWruMu/YH9gEqH/lbMl9g54MkZMAIQEUeGpMN6Los sZMJufotIc9VPrTnVwVNvH5r40I9FZRcBx0ctG0ht2JmfIY9cV6+P6ENnKzMV+NU lIt5rMfv38pul2H7ZpFXizqLIU74p2NyfVspJOneD9j5UmN42GkLs77fREXi5Eox K3jCi7z13eBZNG8v1O+I7p7Ei7RZkx+1tTviJh2RipxK27Ujl/Z3sYYV8zcPvzBo sJfyGXfJevJAH+adHm+L8VbHT4EHzrbSaUb8lqq74ls9PKr5y+g= =75SM -----END PGP SIGNATURE----- Merge tag 'v5.4.134' into 5.4-2.3.x-imx This is the 5.4.134 stable release Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>	2021-07-20 19:17:04 +00:00
Andrey Zhizhikin	0615afea9f	This is the 5.4.133 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmD1IXQACgkQONu9yGCS aT50sBAAtueT2WsCuD1psEN219cK0e7AuLrHXtAnkHFJIPWnzS0vyBy7/hmFwWtq sBSenqG8qufqtLVnMkeEAhu2/sk/5NHRPQJtK0k4hCzt8FQuiVQco1raOtetIJx+ +wBfE4FAGDmiYGkcuzh61n1euvpBetXd9HFfWtSfQq7Q3nN+sfv0q1V9ZK0MUJ8v ipvSY1hTSDEQQJ7cU48DDJtZUGNxrfEFzi4CLI2YVphzoHEbowd5nxtHUL5cwDhx 3sHYJoN+5RAkRinzGyviDlRpodNUUkLusBzs54xNIzgdzkckEKniKotZ2lUGsEu+ QQgj0paNB95GLkY/Rgn6AL03AQdYBgGIjHQkSaYJ+UM9TlacqgFMiGugn28bj0o/ 1F4s6zCWG5tuhM5zNcnTsJmwSPA3eZ0uI6NCkjKC/RAyD5SC6JQqcf5zYCzygdT/ PpeFRcZGoxyQqmOjW2e+tpNAbHuIeayExx/6/3rw3b/xaR9Ju9mYxNDiIpYZwdc6 FIWOsHG+bEEZANiWv6Ju7DfOTKg8F7mbm4Zrd00euIWEsxuUZO/lAzxPR8pPzsn7 2k46PDrhah25Y/tbSE5hdKrLqSorSjIg+7CxLAk7LWPmq13zzEd8y+e/Bk5rFJ4T 7vPLUb23OYFdrVMOXd1UyKhcP4CKyOf7IvG4SsZwj9WfWoNEDNg= =2WCO -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmD25jEACgkQ7G51OISz Hs1QgxAAl7FiNoI3XJ2GbS0bQAxYXeUtcQ6lJmHRD/6IEikDr12bQ9XMvF1k4o+P 15MzTOndSRSW0Yxi5QXRTGcBuqbC+HhnUVHggJpUJugUt9Q5TtO6ZxVX0dkbMhp3 AXK55Zm465y4dX6ys59IWL138xMKd0pBIfchlb3oSaiQ9qBFmKDXMEYDGOxDC3e9 VR1EO5PfpzEgrGONO+Xxu+2IDWRiWfKGyIaZCIWsRqlsrjdOFMKbTL2iBwMgrqmi D3kndN6kGvxqoHCe3P9chKqNfU+P3durBNomhhXyBZBRNT2XW19UVpk3VIhqa5Dw 7DA6zHihFuZlI9XEZKAr4cokxS9IRFWZBayYE4diMu4+BA45mKIS+1BRyPDozgRG cyp6QaGI8IEzdI1oa6WW/CR/zkhQKyIj/lhwlx98XJlkGoDtfSgGMx6QdmtH8Pfk Gmgg3aHV/AQMRasSfKPDLGWD0f3nVzneHh9ceK9/j8gjY+T6msVYw/p4kIhXZQCE cZplDehOsFtJubB5lXxe1PZzHedM5p0mrYousngjVhHjbe/5h243fj3gBsTJK3zZ XP74VPZyqwBtEUrMEL2nPlsQSDfeZRnNSCiXNq1vJg+skqJTMicv2TUP0134ME8K yUjmJxd46diY/bOPBL737MuWMlpm9Bpg31qRe92jSHovKrbZVTk= =rRdV -----END PGP SIGNATURE----- Merge tag 'v5.4.133' into 5.4-2.3.x-imx This is the 5.4.133 stable release Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>	2021-07-20 15:05:19 +00:00
Andrey Zhizhikin	e9646ca701	This is the 5.4.132 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmDu+p0ACgkQONu9yGCS aT5SOw/9F58e4gz7PSTn4A9oCTNodRPe9B9rzf3y1Ol0k7T1aeQoWsPFOkZpNSOJ tdOGEXnwYnLpMC7nuFshWv1uKGAL/weHADyGV6J37AntYFjpEFhJhSH7pGGhDk7V EeIl98luBynPXOKNnDvcrQweeRaHKOInQBT8JJzwwsZbF2oqfOqdU0A787BiRu+3 zoi/mV0upDB443ji/JY0xj+o4jlbsuD0WxEqgkcD2YHL+QvU5Wr0mGys7m5gG9x7 TpKpMic0ILrF1vt/znLL5rOlX497prTvZ74ZXV/DYizeYxqtl/UG3CZjo1uf2yqk pAXA57paz6DY2Ct+3QbJBeuer27bTz6SCClSS1om9AcUk6oNSdULmMdTGvQb0SLU wx1Cy8b2ei04SVl96+McKKZ6ln47LJediGn0qIdwC6O/XHHrLq4u5PkSnQxRU4pA GH1tP5oYy4GzL9RbBeiDJQETFiXwkexSEWVyuSc6BhqQXao9yVzmLQbL1zgjH/zO m/tckZ3vEg+ll8j4QJCisHRyqYhwfru4PsJQH9Q7q6CtIuGOsd0Z/OUcLuF6knXg jDOrDIykE/PnkQ2Dc2RhdONP1ud5j3oBnHvNHs6FDghRKjaixMQzg3g/RNtnAaTj +7Xsfbi6ntpZSDOaY7YNgt+ZH3l4YRnUL/xBA6qIygayz374nzI= =LU0G -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmD25ikACgkQ7G51OISz Hs1dcw/8DKef1hGC5O2WKfpInTYtgnClkyD5/yOnGAPMvMRDGybA3dRejpIEefNM Qol1XICjb0wdBDV0I+n+fnGbBgaX3g0N/pn16pdbbSPBBe1L+d97gZNZznDGHYZu 033qtbxii8e0QTxTvO7nx4L80ZZsyPLchpPxowS/vd1Ezti+pTIU4y43MCc2jYLL KqUBDz72TkPLhgVZdDJ1z9gb+OoJ+sJPaeBrO57hpY/os9SxlMPeY56YrD3Hyfy5 IHZw3bTCDiIXpHBaJG8fvuudaM5M8V3dbD6oXnEPo1Gzb1Y7WR4Z7q28g7arSYjP fMPd243mCXd1V7LpmapxXvFsnbdsA7oauTho50dwmEvxQf9jEgX6thBWAFsrItaS crHdOppS7Lc3FK8cTMxZd6ZyZpaU6sF183tMOteuhtwmF/uoy1LBHqLnAvtfWYrl InGcImgABRkiYBRyODlgC4UNLd49Svon/8HcbBZlmeGIkosXjo5r1itnipgnF/TB /NkHRkixYTBCnJZyx+9Lihqw+HMnHVfjOnIBjbXjzX9ITH/tiMn4y87E+x9vRQqr Td5AKJwiSXSWZBQoX+XNLqXRwjZKHVQe45J4gzL9dhCzi9bwK99BvBPWr8+JyI7w 83YQfkhPju47+KFrEN6DUBxdYrROsJLsjgdTl38IlCi4SKoQSkE= =Li9I -----END PGP SIGNATURE----- Merge tag 'v5.4.132' into 5.4-2.3.x-imx This is the 5.4.132 stable release Conflicts (manual resolve): - drivers/gpu/drm/rockchip/cdn-dp-core.c: Fix merge hiccup when integrating upstream commit `450c25b8a4` ("drm/rockchip: cdn-dp-core: add missing clk_disable_unprepare() on error in cdn_dp_grf_write()") - drivers/perf/fsl_imx8_ddr_perf.c: Port upstream commit `3fea9b708a` ("drivers/perf: fix the missed ida_simple_remove() in ddr_perf_probe()") manually to NXP version. Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>	2021-07-20 15:04:13 +00:00
Frederic Weisbecker	fd005f53cb	srcu: Fix broken node geometry after early ssp init [ Upstream commit b5befe842e6612cf894cf4a199924ee872d8b7d8 ] An srcu_struct structure that is initialized before rcu_init_geometry() will have its srcu_node hierarchy based on CONFIG_NR_CPUS. Once rcu_init_geometry() is called, this hierarchy is compressed as needed for the actual maximum number of CPUs for this system. Later on, that srcu_struct structure is confused, sometimes referring to its initial CONFIG_NR_CPUS-based hierarchy, and sometimes instead to the new num_possible_cpus() hierarchy. For example, each of its ->mynode fields continues to reference the original leaf rcu_node structures, some of which might no longer exist. On the other hand, srcu_for_each_node_breadth_first() traverses to the new node hierarchy. There are at least two bad possible outcomes to this: 1) a) A callback enqueued early on an srcu_data structure (call it *sdp) is recorded pending on sdp->mynode->srcu_data_have_cbs in srcu_funnel_gp_start() with sdp->mynode pointing to a deep leaf (say 3 levels). b) The grace period ends after rcu_init_geometry() shrinks the nodes level to a single one. srcu_gp_end() walks through the new srcu_node hierarchy without ever reaching the old leaves so the callback is never executed. This is easily reproduced on an 8 CPUs machine with CONFIG_NR_CPUS >= 32 and "rcupdate.rcu_self_test=1". The srcu_barrier() after early tests verification never completes and the boot hangs: [ 5413.141029] INFO: task swapper/0:1 blocked for more than 4915 seconds. [ 5413.147564] Not tainted 5.12.0-rc4+ #28 [ 5413.151927] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 5413.159753] task:swapper/0 state:D stack: 0 pid: 1 ppid: 0 flags:0x00004000 [ 5413.168099] Call Trace: [ 5413.170555] __schedule+0x36c/0x930 [ 5413.174057] ? wait_for_completion+0x88/0x110 [ 5413.178423] schedule+0x46/0xf0 [ 5413.181575] schedule_timeout+0x284/0x380 [ 5413.185591] ? wait_for_completion+0x88/0x110 [ 5413.189957] ? mark_held_locks+0x61/0x80 [ 5413.193882] ? mark_held_locks+0x61/0x80 [ 5413.197809] ? _raw_spin_unlock_irq+0x24/0x50 [ 5413.202173] ? wait_for_completion+0x88/0x110 [ 5413.206535] wait_for_completion+0xb4/0x110 [ 5413.210724] ? srcu_torture_stats_print+0x110/0x110 [ 5413.215610] srcu_barrier+0x187/0x200 [ 5413.219277] ? rcu_tasks_verify_self_tests+0x50/0x50 [ 5413.224244] ? rdinit_setup+0x2b/0x2b [ 5413.227907] rcu_verify_early_boot_tests+0x2d/0x40 [ 5413.232700] do_one_initcall+0x63/0x310 [ 5413.236541] ? rdinit_setup+0x2b/0x2b [ 5413.240207] ? rcu_read_lock_sched_held+0x52/0x80 [ 5413.244912] kernel_init_freeable+0x253/0x28f [ 5413.249273] ? rest_init+0x250/0x250 [ 5413.252846] kernel_init+0xa/0x110 [ 5413.256257] ret_from_fork+0x22/0x30 2) An srcu_struct structure that is initialized before rcu_init_geometry() and used afterward will always have stale rdp->mynode references, resulting in callbacks to be missed in srcu_gp_end(), just like in the previous scenario. This commit therefore causes init_srcu_struct_nodes to initialize the geometry, if needed. This ensures that the srcu_node hierarchy is properly built and distributed from the get-go. Suggested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-07-20 16:10:41 +02:00
Christian Brauner	c17363ccd6	cgroup: verify that source is a string commit 3b0462726e7ef281c35a7a4ae33e93ee2bc9975b upstream. The following sequence can be used to trigger a UAF: int fscontext_fd = fsopen("cgroup"); int fd_null = open("/dev/null, O_RDONLY); int fsconfig(fscontext_fd, FSCONFIG_SET_FD, "source", fd_null); close_range(3, ~0U, 0); The cgroup v1 specific fs parser expects a string for the "source" parameter. However, it is perfectly legitimate to e.g. specify a file descriptor for the "source" parameter. The fs parser doesn't know what a filesystem allows there. So it's a bug to assume that "source" is always of type fs_value_is_string when it can reasonably also be fs_value_is_file. This assumption in the cgroup code causes a UAF because struct fs_parameter uses a union for the actual value. Access to that union is guarded by the param->type member. Since the cgroup paramter parser didn't check param->type but unconditionally moved param->string into fc->source a close on the fscontext_fd would trigger a UAF during put_fs_context() which frees fc->source thereby freeing the file stashed in param->file causing a UAF during a close of the fd_null. Fix this by verifying that param->type is actually a string and report an error if not. In follow up patches I'll add a new generic helper that can be used here and by other filesystems instead of this error-prone copy-pasta fix. But fixing it in here first makes backporting a it to stable a lot easier. Fixes: `8d2451f499` ("cgroup1: switch to option-by-option parsing") Reported-by: syzbot+283ce5a46486d6acdbaf@syzkaller.appspotmail.com Cc: Christoph Hellwig <hch@lst.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: <stable@kernel.org> Cc: syzkaller-bugs <syzkaller-bugs@googlegroups.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-07-20 16:10:40 +02:00
Steven Rostedt (VMware)	d4238c7539	tracing: Do not reference char * as a string in histograms commit 704adfb5a9978462cd861f170201ae2b5e3d3a80 upstream. The histogram logic was allowing events with char * pointers to be used as normal strings. But it was easy to crash the kernel with: # echo 'hist:keys=filename' > events/syscalls/sys_enter_openat/trigger And open some files, and boom! BUG: unable to handle page fault for address: 00007f2ced0c3280 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 1173fa067 P4D 1173fa067 PUD 1171b6067 PMD 1171dd067 PTE 0 Oops: 0000 [#1] PREEMPT SMP CPU: 6 PID: 1810 Comm: cat Not tainted 5.13.0-rc5-test+ #61 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016 RIP: 0010:strlen+0x0/0x20 Code: f6 82 80 2a 0b a9 20 74 11 0f b6 50 01 48 83 c0 01 f6 82 80 2a 0b a9 20 75 ef c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 <80> 3f 00 74 10 48 89 f8 48 83 c0 01 80 38 00 75 f7 48 29 f8 c3 RSP: 0018:ffffbdbf81567b50 EFLAGS: 00010246 RAX: 0000000000000003 RBX: ffff93815cdb3800 RCX: ffff9382401a22d0 RDX: 0000000000000100 RSI: 0000000000000000 RDI: 00007f2ced0c3280 RBP: 0000000000000100 R08: ffff9382409ff074 R09: ffffbdbf81567c98 R10: ffff9382409ff074 R11: 0000000000000000 R12: ffff9382409ff074 R13: 0000000000000001 R14: ffff93815a744f00 R15: 00007f2ced0c3280 FS: 00007f2ced0f8580(0000) GS:ffff93825a800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f2ced0c3280 CR3: 0000000107069005 CR4: 00000000001706e0 Call Trace: event_hist_trigger+0x463/0x5f0 ? find_held_lock+0x32/0x90 ? sched_clock_cpu+0xe/0xd0 ? lock_release+0x155/0x440 ? kernel_init_free_pages+0x6d/0x90 ? preempt_count_sub+0x9b/0xd0 ? kernel_init_free_pages+0x6d/0x90 ? get_page_from_freelist+0x12c4/0x1680 ? __rb_reserve_next+0xe5/0x460 ? ring_buffer_lock_reserve+0x12a/0x3f0 event_triggers_call+0x52/0xe0 ftrace_syscall_enter+0x264/0x2c0 syscall_trace_enter.constprop.0+0x1ee/0x210 do_syscall_64+0x1c/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae Where it triggered a fault on strlen(key) where key was the filename. The reason is that filename is a char * to user space, and the histogram code just blindly dereferenced it, with obvious bad results. I originally tried to use strncpy_from_user/kernel_nofault() but found that there's other places that its dereferenced and not worth the effort. Just do not allow "char *" to act like strings. Link: https://lkml.kernel.org/r/20210715000206.025df9d2@rorschach.local.home Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: stable@vger.kernel.org Acked-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Tom Zanussi <zanussi@kernel.org> Fixes: `79e577cbce` ("tracing: Support string type key properly") Fixes: `5967bd5c42` ("tracing: Let filter_assign_type() detect FILTER_PTR_STRING") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-07-20 16:10:40 +02:00
Paul Burton	8489ebfac3	tracing: Resize tgid_map to pid_max, not PID_MAX_DEFAULT commit 4030a6e6a6a4a42ff8c18414c9e0c93e24cc70b8 upstream. Currently tgid_map is sized at PID_MAX_DEFAULT entries, which means that on systems where pid_max is configured higher than PID_MAX_DEFAULT the ftrace record-tgid option doesn't work so well. Any tasks with PIDs higher than PID_MAX_DEFAULT are simply not recorded in tgid_map, and don't show up in the saved_tgids file. In particular since systemd v243 & above configure pid_max to its highest possible 1<<22 value by default on 64 bit systems this renders the record-tgids option of little use. Increase the size of tgid_map to the configured pid_max instead, allowing it to cover the full range of PIDs up to the maximum value of PID_MAX_LIMIT if the system is configured that way. On 64 bit systems with pid_max == PID_MAX_LIMIT this will increase the size of tgid_map from 256KiB to 16MiB. Whilst this 64x increase in memory overhead sounds significant 64 bit systems are presumably best placed to accommodate it, and since tgid_map is only allocated when the record-tgid option is actually used presumably the user would rather it spends sufficient memory to actually record the tgids they expect. The size of tgid_map could also increase for CONFIG_BASE_SMALL=y configurations, but these seem unlikely to be systems upon which people are both configuring a large pid_max and running ftrace with record-tgid anyway. Of note is that we only allocate tgid_map once, the first time that the record-tgid option is enabled. Therefore its size is only set once, to the value of pid_max at the time the record-tgid option is first enabled. If a user increases pid_max after that point, the saved_tgids file will not contain entries for any tasks with pids beyond the earlier value of pid_max. Link: https://lkml.kernel.org/r/20210701172407.889626-2-paulburton@google.com Fixes: `d914ba37d7` ("tracing: Add support for recording tgid of tasks") Cc: Ingo Molnar <mingo@redhat.com> Cc: Joel Fernandes <joelaf@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Paul Burton <paulburton@google.com> [ Fixed comment coding style ] Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-07-19 08:53:17 +02:00
Paul Burton	41aa590302	tracing: Simplify & fix saved_tgids logic commit b81b3e959adb107cd5b36c7dc5ba1364bbd31eb2 upstream. The tgid_map array records a mapping from pid to tgid, where the index of an entry within the array is the pid & the value stored at that index is the tgid. The saved_tgids_next() function iterates over pointers into the tgid_map array & dereferences the pointers which results in the tgid, but then it passes that dereferenced value to trace_find_tgid() which treats it as a pid & does a further lookup within the tgid_map array. It seems likely that the intent here was to skip over entries in tgid_map for which the recorded tgid is zero, but instead we end up skipping over entries for which the thread group leader hasn't yet had its own tgid recorded in tgid_map. A minimal fix would be to remove the call to trace_find_tgid, turning: if (trace_find_tgid(ptr)) into: if (ptr) ..but it seems like this logic can be much simpler if we simply let seq_read() iterate over the whole tgid_map array & filter out empty entries by returning SEQ_SKIP from saved_tgids_show(). Here we take that approach, removing the incorrect logic here entirely. Link: https://lkml.kernel.org/r/20210630003406.4013668-1-paulburton@google.com Fixes: `d914ba37d7` ("tracing: Add support for recording tgid of tasks") Cc: Ingo Molnar <mingo@redhat.com> Cc: Joel Fernandes <joelaf@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Paul Burton <paulburton@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-07-19 08:53:17 +02:00
Jan Kara	4d4f11c356	rq-qos: fix missed wake-ups in rq_qos_throttle try two commit 11c7aa0ddea8611007768d3e6b58d45dc60a19e1 upstream. Commit `545fbd0775` ("rq-qos: fix missed wake-ups in rq_qos_throttle") tried to fix a problem that a process could be sleeping in rq_qos_wait() without anyone to wake it up. However the fix is not complete and the following can still happen: CPU1 (waiter1) CPU2 (waiter2) CPU3 (waker) rq_qos_wait() rq_qos_wait() acquire_inflight_cb() -> fails acquire_inflight_cb() -> fails completes IOs, inflight decreased prepare_to_wait_exclusive() prepare_to_wait_exclusive() has_sleeper = !wq_has_single_sleeper() -> true as there are two sleepers has_sleeper = !wq_has_single_sleeper() -> true io_schedule() io_schedule() Deadlock as now there's nobody to wakeup the two waiters. The logic automatically blocking when there are already sleepers is really subtle and the only way to make it work reliably is that we check whether there are some waiters in the queue when adding ourselves there. That way, we are guaranteed that at least the first process to enter the wait queue will recheck the waiting condition before going to sleep and thus guarantee forward progress. Fixes: `545fbd0775` ("rq-qos: fix missed wake-ups in rq_qos_throttle") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210607112613.25344-1-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-07-19 08:53:16 +02:00
Thomas Gleixner	7044e6bbc8	cpu/hotplug: Cure the cpusets trainwreck commit b22afcdf04c96ca58327784e280e10288cfd3303 upstream. Alexey and Joshua tried to solve a cpusets related hotplug problem which is user space visible and results in unexpected behaviour for some time after a CPU has been plugged in and the corresponding uevent was delivered. cpusets delegate the hotplug work (rebuilding cpumasks etc.) to a workqueue. This is done because the cpusets code has already a lock nesting of cgroups_mutex -> cpu_hotplug_lock. A synchronous callback or waiting for the work to finish with cpu_hotplug_lock held can and will deadlock because that results in the reverse lock order. As a consequence the uevent can be delivered before cpusets have consistent state which means that a user space invocation of sched_setaffinity() to move a task to the plugged CPU fails up to the point where the scheduled work has been processed. The same is true for CPU unplug, but that does not create user observable failure (yet). It's still inconsistent to claim that an operation is finished before it actually is and that's the real issue at hand. uevents just make it reliably observable. Obviously the problem should be fixed in cpusets/cgroups, but untangling that is pretty much impossible because according to the changelog of the commit which introduced this 8 years ago: 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()") the lock order cgroups_mutex -> cpu_hotplug_lock is a design decision and the whole code is built around that. So bite the bullet and invoke the relevant cpuset function, which waits for the work to finish, in _cpu_up/down() after dropping cpu_hotplug_lock and only when tasks are not frozen by suspend/hibernate because that would obviously wait forever. Waiting there with cpu_add_remove_lock, which is protecting the present and possible CPU maps, held is not a problem at all because neither work queues nor cpusets/cgroups have any lockchains related to that lock. Waiting in the hotplug machinery is not problematic either because there are already state callbacks which wait for hardware queues to drain. It makes the operations slightly slower, but hotplug is slow anyway. This ensures that state is consistent before returning from a hotplug up/down operation. It's still inconsistent during the operation, but that's a different story. Add a large comment which explains why this is done and why this is not a dump ground for the hack of the day to work around half thought out locking schemes. Document also the implications vs. hotplug operations and serialization or the lack of it. Thanks to Alexy and Joshua for analyzing why this temporary sched_setaffinity() failure happened. Fixes: 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()") Reported-by: Alexey Klimov <aklimov@redhat.com> Reported-by: Joshua Baker <jobaker@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Alexey Klimov <aklimov@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/87tuowcnv3.ffs@nanos.tec.linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-07-19 08:53:15 +02:00
Daniel Borkmann	e217aadc9b	bpf: Fix up register-based shifts in interpreter to silence KUBSAN [ Upstream commit 28131e9d933339a92f78e7ab6429f4aaaa07061c ] syzbot reported a shift-out-of-bounds that KUBSAN observed in the interpreter: [...] UBSAN: shift-out-of-bounds in kernel/bpf/core.c:1420:2 shift exponent 255 is too large for 64-bit type 'long long unsigned int' CPU: 1 PID: 11097 Comm: syz-executor.4 Not tainted 5.12.0-rc2-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x141/0x1d7 lib/dump_stack.c:120 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:327 ___bpf_prog_run.cold+0x19/0x56c kernel/bpf/core.c:1420 __bpf_prog_run32+0x8f/0xd0 kernel/bpf/core.c:1735 bpf_dispatcher_nop_func include/linux/bpf.h:644 [inline] bpf_prog_run_pin_on_cpu include/linux/filter.h:624 [inline] bpf_prog_run_clear_cb include/linux/filter.h:755 [inline] run_filter+0x1a1/0x470 net/packet/af_packet.c:2031 packet_rcv+0x313/0x13e0 net/packet/af_packet.c:2104 dev_queue_xmit_nit+0x7c2/0xa90 net/core/dev.c:2387 xmit_one net/core/dev.c:3588 [inline] dev_hard_start_xmit+0xad/0x920 net/core/dev.c:3609 __dev_queue_xmit+0x2121/0x2e00 net/core/dev.c:4182 __bpf_tx_skb net/core/filter.c:2116 [inline] __bpf_redirect_no_mac net/core/filter.c:2141 [inline] __bpf_redirect+0x548/0xc80 net/core/filter.c:2164 ____bpf_clone_redirect net/core/filter.c:2448 [inline] bpf_clone_redirect+0x2ae/0x420 net/core/filter.c:2420 ___bpf_prog_run+0x34e1/0x77d0 kernel/bpf/core.c:1523 __bpf_prog_run512+0x99/0xe0 kernel/bpf/core.c:1737 bpf_dispatcher_nop_func include/linux/bpf.h:644 [inline] bpf_test_run+0x3ed/0xc50 net/bpf/test_run.c:50 bpf_prog_test_run_skb+0xabc/0x1c50 net/bpf/test_run.c:582 bpf_prog_test_run kernel/bpf/syscall.c:3127 [inline] __do_sys_bpf+0x1ea9/0x4f00 kernel/bpf/syscall.c:4406 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xae [...] Generally speaking, KUBSAN reports from the kernel should be fixed. However, in case of BPF, this particular report caused concerns since the large shift is not wrong from BPF point of view, just undefined. In the verifier, K-based shifts that are >= {64,32} (depending on the bitwidth of the instruction) are already rejected. The register-based cases were not given their content might not be known at verification time. Ideas such as verifier instruction rewrite with an additional AND instruction for the source register were brought up, but regularly rejected due to the additional runtime overhead they incur. As Edward Cree rightly put it: Shifts by more than insn bitness are legal in the BPF ISA; they are implementation-defined behaviour [of the underlying architecture], rather than UB, and have been made legal for performance reasons. Each of the JIT backends compiles the BPF shift operations to machine instructions which produce implementation-defined results in such a case; the resulting contents of the register may be arbitrary but program behaviour as a whole remains defined. Guard checks in the fast path (i.e. affecting JITted code) will thus not be accepted. The case of division by zero is not truly analogous here, as division instructions on many of the JIT-targeted architectures will raise a machine exception / fault on division by zero, whereas (to the best of my knowledge) none will do so on an out-of-bounds shift. Given the KUBSAN report only affects the BPF interpreter, but not JITs, one solution is to add the ANDs with 63 or 31 into ___bpf_prog_run(). That would make the shifts defined, and thus shuts up KUBSAN, and the compiler would optimize out the AND on any CPU that interprets the shift amounts modulo the width anyway (e.g., confirmed from disassembly that on x86-64 and arm64 the generated interpreter code is the same before and after this fix). The BPF interpreter is slow path, and most likely compiled out anyway as distros select BPF_JIT_ALWAYS_ON to avoid speculative execution of BPF instructions by the interpreter. Given the main argument was to avoid sacrificing performance, the fact that the AND is optimized away from compiler for mainstream archs helps as well as a solution moving forward. Also add a comment on LSH/RSH/ARSH translation for JIT authors to provide guidance when they see the ___bpf_prog_run() interpreter code and use it as a model for a new JIT backend. Reported-by: syzbot+bed360704c521841c85d@syzkaller.appspotmail.com Reported-by: Kurt Manucredo <fuzzybritches0@gmail.com> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Co-developed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: syzbot+bed360704c521841c85d@syzkaller.appspotmail.com Cc: Edward Cree <ecree.xilinx@gmail.com> Link: https://lore.kernel.org/bpf/0000000000008f912605bd30d5d7@google.com Link: https://lore.kernel.org/bpf/bac16d8d-c174-bdc4-91bd-bfa62b410190@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-07-19 08:53:11 +02:00
Paul E. McKenney	61f6c18fff	rcu: Invoke rcu_spawn_core_kthreads() from rcu_spawn_gp_kthread() [ Upstream commit 8e4b1d2bc198e34b48fc7cc3a3c5a2fcb269e271 ] Currently, rcu_spawn_core_kthreads() is invoked via an early_initcall(), which works, except that rcu_spawn_gp_kthread() is also invoked via an early_initcall() and rcu_spawn_core_kthreads() relies on adjustments to kthread_prio that are carried out by rcu_spawn_gp_kthread(). There is no guaranttee of ordering among early_initcall() handlers, and thus no guarantee that kthread_prio will be properly checked and range-limited at the time that rcu_spawn_core_kthreads() needs it. In most cases, this bug is harmless. After all, the only reason that rcu_spawn_gp_kthread() adjusts the value of kthread_prio is if the user specified a nonsensical value for this boot parameter, which experience indicates is rare. Nevertheless, a bug is a bug. This commit therefore causes the rcu_spawn_core_kthreads() function to be invoked directly from rcu_spawn_gp_kthread() after any needed adjustments to kthread_prio have been carried out. Fixes: `48d07c04b4` ("rcu: Enable elimination of Tree-RCU softirq processing") Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-07-14 16:53:36 +02:00
Qais Yousef	97f32c7f33	sched/uclamp: Fix uclamp_tg_restrict() [ Upstream commit 0213b7083e81f4acd69db32cb72eb4e5f220329a ] Now cpu.uclamp.min acts as a protection, we need to make sure that the uclamp request of the task is within the allowed range of the cgroup, that is it is clamp()'ed correctly by tg->uclamp[UCLAMP_MIN] and tg->uclamp[UCLAMP_MAX]. As reported by Xuewen [1] we can have some corner cases where there's inversion between uclamp requested by task (p) and the uclamp values of the taskgroup it's attached to (tg). Following table demonstrates 2 corner cases: \| p \| tg \| effective -----------+-----+------+----------- CASE 1 -----------+-----+------+----------- uclamp_min \| 60% \| 0% \| 60% -----------+-----+------+----------- uclamp_max \| 80% \| 50% \| 50% -----------+-----+------+----------- CASE 2 -----------+-----+------+----------- uclamp_min \| 0% \| 30% \| 30% -----------+-----+------+----------- uclamp_max \| 20% \| 50% \| 20% -----------+-----+------+----------- With this fix we get: \| p \| tg \| effective -----------+-----+------+----------- CASE 1 -----------+-----+------+----------- uclamp_min \| 60% \| 0% \| 50% -----------+-----+------+----------- uclamp_max \| 80% \| 50% \| 50% -----------+-----+------+----------- CASE 2 -----------+-----+------+----------- uclamp_min \| 0% \| 30% \| 30% -----------+-----+------+----------- uclamp_max \| 20% \| 50% \| 30% -----------+-----+------+----------- Additionally uclamp_update_active_tasks() must now unconditionally update both UCLAMP_MIN/MAX because changing the tg's UCLAMP_MAX for instance could have an impact on the effective UCLAMP_MIN of the tasks. \| p \| tg \| effective -----------+-----+------+----------- old -----------+-----+------+----------- uclamp_min \| 60% \| 0% \| 50% -----------+-----+------+----------- uclamp_max \| 80% \| 50% \| 50% -----------+-----+------+----------- new -----------+-----+------+----------- uclamp_min \| 60% \| 0% \| 60% -----------+-----+------+----------- uclamp_max \| 80% \|70% \| 70% -----------+-----+------+----------- [1] https://lore.kernel.org/lkml/CAB8ipk_a6VFNjiEnHRHkUMBKbA+qzPQvhtNjJ_YNzQhqV_o8Zw@mail.gmail.com/ Fixes: 0c18f2ecfcc2 ("sched/uclamp: Fix wrong implementation of cpu.uclamp.min") Reported-by: Xuewen Yan <xuewen.yan94@gmail.com> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210617165155.3774110-1-qais.yousef@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-07-14 16:53:24 +02:00
Vincent Donnefort	a3ddf1fb37	sched/rt: Fix Deadline utilization tracking during policy change [ Upstream commit d7d607096ae6d378b4e92d49946d22739c047d4c ] DL keeps track of the utilization on a per-rq basis with the structure avg_dl. This utilization is updated during task_tick_dl(), put_prev_task_dl() and set_next_task_dl(). However, when the current running task changes its policy, set_next_task_dl() which would usually take care of updating the utilization when the rq starts running DL tasks, will not see a such change, leaving the avg_dl structure outdated. When that very same task will be dequeued later, put_prev_task_dl() will then update the utilization, based on a wrong last_update_time, leading to a huge spike in the DL utilization signal. The signal would eventually recover from this issue after few ms. Even if no DL tasks are run, avg_dl is also updated in __update_blocked_others(). But as the CPU capacity depends partly on the avg_dl, this issue has nonetheless a significant impact on the scheduler. Fix this issue by ensuring a load update when a running task changes its policy to DL. Fixes: `3727e0e` ("sched/dl: Add dl_rq utilization tracking") Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/1624271872-211872-3-git-send-email-vincent.donnefort@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-07-14 16:53:24 +02:00
Vincent Donnefort	3fb53be07f	sched/rt: Fix RT utilization tracking during policy change [ Upstream commit fecfcbc288e9f4923f40fd23ca78a6acdc7fdf6c ] RT keeps track of the utilization on a per-rq basis with the structure avg_rt. This utilization is updated during task_tick_rt(), put_prev_task_rt() and set_next_task_rt(). However, when the current running task changes its policy, set_next_task_rt() which would usually take care of updating the utilization when the rq starts running RT tasks, will not see a such change, leaving the avg_rt structure outdated. When that very same task will be dequeued later, put_prev_task_rt() will then update the utilization, based on a wrong last_update_time, leading to a huge spike in the RT utilization signal. The signal would eventually recover from this issue after few ms. Even if no RT tasks are run, avg_rt is also updated in __update_blocked_others(). But as the CPU capacity depends partly on the avg_rt, this issue has nonetheless a significant impact on the scheduler. Fix this issue by ensuring a load update when a running task changes its policy to RT. Fixes: `371bf427` ("sched/rt: Add rt_rq utilization tracking") Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/1624271872-211872-2-git-send-email-vincent.donnefort@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-07-14 16:53:24 +02:00
Qais Yousef	8e5ffc1039	sched/uclamp: Fix locking around cpu_util_update_eff() [ Upstream commit 93b73858701fd01de26a4a874eb95f9b7156fd4b ] cpu_cgroup_css_online() calls cpu_util_update_eff() without holding the uclamp_mutex or rcu_read_lock() like other call sites, which is a mistake. The uclamp_mutex is required to protect against concurrent reads and writes that could update the cgroup hierarchy. The rcu_read_lock() is required to traverse the cgroup data structures in cpu_util_update_eff(). Surround the caller with the required locks and add some asserts to better document the dependency in cpu_util_update_eff(). Fixes: 7226017ad37a ("sched/uclamp: Fix a bug in propagating uclamp value in new cgroups") Reported-by: Quentin Perret <qperret@google.com> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210510145032.1934078-3-qais.yousef@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-07-14 16:53:20 +02:00
Qais Yousef	0b199ce65b	sched/uclamp: Fix wrong implementation of cpu.uclamp.min [ Upstream commit 0c18f2ecfcc274a4bcc1d122f79ebd4001c3b445 ] cpu.uclamp.min is a protection as described in cgroup-v2 Resource Distribution Model Documentation/admin-guide/cgroup-v2.rst which means we try our best to preserve the minimum performance point of tasks in this group. See full description of cpu.uclamp.min in the cgroup-v2.rst. But the current implementation makes it a limit, which is not what was intended. For example: tg->cpu.uclamp.min = 20% p0->uclamp[UCLAMP_MIN] = 0 p1->uclamp[UCLAMP_MIN] = 50% Previous Behavior (limit): p0->effective_uclamp = 0 p1->effective_uclamp = 20% New Behavior (Protection): p0->effective_uclamp = 20% p1->effective_uclamp = 50% Which is inline with how protections should work. With this change the cgroup and per-task behaviors are the same, as expected. Additionally, we remove the confusing relationship between cgroup and !user_defined flag. We don't want for example RT tasks that are boosted by default to max to change their boost value when they attach to a cgroup. If a cgroup wants to limit the max performance point of tasks attached to it, then cpu.uclamp.max must be set accordingly. Or if they want to set different boost value based on cgroup, then sysctl_sched_util_clamp_min_rt_default must be used to NOT boost to max and set the right cpu.uclamp.min for each group to let the RT tasks obtain the desired boost value when attached to that group. As it stands the dependency on !user_defined flag adds an extra layer of complexity that is not required now cpu.uclamp.min behaves properly as a protection. The propagation model of effective cpu.uclamp.min in child cgroups as implemented by cpu_util_update_eff() is still correct. The parent protection sets an upper limit of what the child cgroups will effectively get. Fixes: `3eac870a32` (sched/uclamp: Use TG's clamps to restrict TASK's clamps) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210510145032.1934078-2-qais.yousef@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-07-14 16:53:20 +02:00
Petr Mladek	a3aab894d9	kthread_worker: fix return value when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync() [ Upstream commit d71ba1649fa3c464c51ec7163e4b817345bff2c7 ] kthread_mod_delayed_work() might race with kthread_cancel_delayed_work_sync() or another kthread_mod_delayed_work() call. The function lets the other operation win when it sees work->canceling counter set. And it returns @false. But it should return @true as it is done by the related workqueue API, see mod_delayed_work_on(). The reason is that the return value might be used for reference counting. It has to distinguish the case when the number of queued works has changed or stayed the same. The change is safe. kthread_mod_delayed_work() return value is not checked anywhere at the moment. Link: https://lore.kernel.org/r/20210521163526.GA17916@redhat.com Link: https://lkml.kernel.org/r/20210610133051.15337-4-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Minchan Kim <minchan@google.com> Cc: <jenhaochen@google.com> Cc: Martin Liu <liumartin@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-07-14 16:53:19 +02:00
Paul E. McKenney	dba9cda5aa	clocksource: Retry clock read if long delays detected [ Upstream commit db3a34e17433de2390eb80d436970edcebd0ca3e ] When the clocksource watchdog marks a clock as unstable, this might be due to that clock being unstable or it might be due to delays that happen to occur between the reads of the two clocks. Yes, interrupts are disabled across those two reads, but there are no shortage of things that can delay interrupts-disabled regions of code ranging from SMI handlers to vCPU preemption. It would be good to have some indication as to why the clock was marked unstable. Therefore, re-read the watchdog clock on either side of the read from the clock under test. If the watchdog clock shows an excessive time delta between its pair of reads, the reads are retried. The maximum number of retries is specified by a new kernel boot parameter clocksource.max_cswd_read_retries, which defaults to three, that is, up to four reads, one initial and up to three retries. If more than one retry was required, a message is printed on the console (the occasional single retry is expected behavior, especially in guest OSes). If the maximum number of retries is exceeded, the clock under test will be marked unstable. However, the probability of this happening due to various sorts of delays is quite small. In addition, the reason (clock-read delays) for the unstable marking will be apparent. Reported-by: Chris Mason <clm@fb.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Feng Tang <feng.tang@intel.com> Link: https://lore.kernel.org/r/20210527190124.440372-1-paulmck@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-07-14 16:53:18 +02:00
Boqun Feng	2ef6cd6e48	lockding/lockdep: Avoid to find wrong lock dep path in check_irq_usage() [ Upstream commit 7b1f8c6179769af6ffa055e1169610b51d71edd5 ] In the step #3 of check_irq_usage(), we seach backwards to find a lock whose usage conflicts the usage of @target_entry1 on safe/unsafe. However, we should only keep the irq-unsafe usage of @target_entry1 into consideration, because it could be a case where a lock is hardirq-unsafe but soft-safe, and in check_irq_usage() we find it because its hardirq-unsafe could result into a hardirq-safe-unsafe deadlock, but currently since we don't filter out the other usage bits, so we may find a lock dependency path softirq-unsafe -> softirq-safe, which in fact doesn't cause a deadlock. And this may cause misleading lockdep splats. Fix this by only keeping LOCKF_ENABLED_IRQ_ALL bits when we try the backwards search. Reported-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210618170110.3699115-4-boqun.feng@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-07-14 16:53:15 +02:00
Boqun Feng	1b45a85262	locking/lockdep: Fix the dep path printing for backwards BFS [ Upstream commit 69c7a5fb2482636f525f016c8333fdb9111ecb9d ] We use the same code to print backwards lock dependency path as the forwards lock dependency path, and this could result into incorrect printing because for a backwards lock_list ->trace is not the call trace where the lock of ->class is acquired. Fix this by introducing a separate function on printing the backwards dependency path. Also add a few comments about the printing while we are at it. Reported-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210618170110.3699115-2-boqun.feng@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-07-14 16:53:15 +02:00
Odin Ugedal	432188f626	sched/fair: Fix ascii art by relpacing tabs [ Upstream commit 08f7c2f4d0e9f4283f5796b8168044c034a1bfcb ] When using something other than 8 spaces per tab, this ascii art makes not sense, and the reader might end up wondering what this advanced equation "is". Signed-off-by: Odin Ugedal <odin@uged.al> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210518125202.78658-4-odin@uged.al Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-07-14 16:53:12 +02:00
Steven Rostedt (VMware)	c65755f595	tracepoint: Add tracepoint_probe_register_may_exist() for BPF tracing commit 9913d5745bd720c4266805c8d29952a3702e4eca upstream. All internal use cases for tracepoint_probe_register() is set to not ever be called with the same function and data. If it is, it is considered a bug, as that means the accounting of handling tracepoints is corrupted. If the function and data for a tracepoint is already registered when tracepoint_probe_register() is called, it will call WARN_ON_ONCE() and return with EEXISTS. The BPF system call can end up calling tracepoint_probe_register() with the same data, which now means that this can trigger the warning because of a user space process. As WARN_ON_ONCE() should not be called because user space called a system call with bad data, there needs to be a way to register a tracepoint without triggering a warning. Enter tracepoint_probe_register_may_exist(), which can be called, but will not cause a WARN_ON() if the probe already exists. It will still error out with EEXIST, which will then be sent to the user space that performed the BPF system call. This keeps the previous testing for issues with other users of the tracepoint code, while letting BPF call it with duplicated data and not warn about it. Link: https://lore.kernel.org/lkml/20210626135845.4080-1-penguin-kernel@I-love.SAKURA.ne.jp/ Link: https://syzkaller.appspot.com/bug?id=41f4318cf01762389f4d1c1c459da4f542fe5153 Cc: stable@vger.kernel.org Fixes: `c4f6699dfc` ("bpf: introduce BPF_RAW_TRACEPOINT") Reported-by: syzbot <syzbot+721aa903751db87aa244@syzkaller.appspotmail.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Tested-by: syzbot+721aa903751db87aa244@syzkaller.appspotmail.com Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-07-14 16:53:08 +02:00
Steven Rostedt (VMware)	acf8494ba5	tracing/histograms: Fix parsing of "sym-offset" modifier commit 26c563731056c3ee66f91106c3078a8c36bb7a9e upstream. With the addition of simple mathematical operations (plus and minus), the parsing of the "sym-offset" modifier broke, as it took the '-' part of the "sym-offset" as a minus, and tried to break it up into a mathematical operation of "field.sym - offset", in which case it failed to parse (unless the event had a field called "offset"). Both .sym and .sym-offset modifiers should not be entered into mathematical calculations anyway. If ".sym-offset" is found in the modifier, then simply make it not an operation that can be calculated on. Link: https://lkml.kernel.org/r/20210707110821.188ae255@oasis.local.home Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: stable@vger.kernel.org Fixes: `100719dcef` ("tracing: Add simple expression support to hist triggers") Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-07-14 16:53:07 +02:00
Andrey Zhizhikin	3aef3ef268	Linux 5.4.129 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE4n5dijQDou9mhzu83qZv95d3LNwFAmDcbxkACgkQ3qZv95d3 LNxZMBAArNPLhVYdEDDFosb6Y/5RGjjZ/79OGHH0p5YiTo8D+wBHi+wXRl5Jp0PA 3YVVU8lDTbeDm7E7uWeduWjFwEpsPBL8395scbhC6VR3PfnyunjarVXZgi6EHnMl p6HjXXtQ1jTrdDSziGDIhZVQT5FGb2/MMx9m69mfi5BTLjGfWy8chHFbC2GZszlp Znu9syjisUBbc4I4XHFgXw0hoQSSig6SUTZCrdTpIW/PZ0swfl8ZPxREh0CZNMpw Y2orRt+oHlkWPw1/sSkoTE1PRvXwNWFXyw5caOu846jAfhKtxO54SsqJqhM7VLHZ pdH4eb6q7AFyt0A62HkIqa5oabs5Vk9G24b8m5ggc2F/UTkHqgwUcMCud0d3DYL0 Q7OEAmThQzHHKJ+CeNRJLsiKqVBNHmeS24B+ELldlAiX22vLr9pUsIb342Au1ZjR S3BTnneAbYGBv4qUoV2yUF9wQ/LxsFMSl/vmjCBOxg7c3LbKYChUwskYnvd6EwWj ObCyLU6FK9HWXSBSp/X+irlF1CLla+HuOC+Aej2U5a8DtmHId4LHMeq/XOxZ9s/8 QUoX4rh5P+TJ8PIiTqXKrQo5rnR79MiYssIhUozKTdt9ZoMtXzI4mVLXN/yzAVD9 v4aWYx8m2x17Wq+ptaLMSTSed4m3c25uEl4MucLBmKQV8ClAxW8= =Sijo -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmDchOoACgkQ7G51OISz Hs3s9Q/+OfeIcTzOY6qDX53PSxK5iF3xZEiPR5HOSVu1cY4+1tyt0bkYdQJ2N//d NH6YCuOsVI5chRxKN8b6I01U6xxQ0VNvj5RYmuC0AqhhzJBpfhh6FrrGuVxP+dCj aa2twfRZO+Y0yzrIYdSvYdRhJlEhMbKGNyHu871tyQQTlp0eQKy9lBhcCuVSCMso p64lRlK35GDi6TaWQ7R0SYOziWWSHByf4p3h4ZlpyhB7w6yMNjNrts4qSoZb2uOY I453WroS7tWV3qkOZ9FeqjHyg4mjBg87/wSMZ0DbBANinlQZW5YTOjE/WyPossL3 C4jTP8NBLsng+ZfhAaMRfbvxGj5gbY0ghSOmsuglAOQDd+tlTfB23pkgHFlj6aie FUplJnTLd8VXPWkoNVKH6GfG2rZAUtQKM4EWDPt2KK3Dch9W3YH1bHL9EyxR0G7f aDBHM9egTyPOYNcfeVDj0KEJTrOxkmCcYz83D+eEDGKHi//Gcb+k9Gsg6q8X6Yxa ejFx5RBM5SW57LA70ZK3+HWhqfqcILYJT7Lxoue7bH+15vSGnTkmJKS80OUrP8w8 y9gQ9WNsmjQUHvtKPAIZUZYlUS05FhtQ+faC8XyrLnfZj+ZCGBZor/KsNe0SP/UN bAYudfCn0YkjGfTNcljOe4oAnwmnVKAjGlQkHOFw+llmM0czEBk= =vpHo -----END PGP SIGNATURE----- Merge tag 'v5.4.129' into 5.4-2.3.x-imx Linux 5.4.129 Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>	2021-06-30 14:51:20 +00:00
Hugh Dickins	61168eafe0	mm, futex: fix shared futex pgoff on shmem huge page [ Upstream commit fe19bd3dae3d15d2fbfdb3de8839a6ea0fe94264 ] If more than one futex is placed on a shmem huge page, it can happen that waking the second wakes the first instead, and leaves the second waiting: the key's shared.pgoff is wrong. When 3.11 commit `13d60f4b6a` ("futex: Take hugepages into account when generating futex_key"), the only shared huge pages came from hugetlbfs, and the code added to deal with its exceptional page->index was put into hugetlb source. Then that was missed when 4.8 added shmem huge pages. page_to_pgoff() is what others use for this nowadays: except that, as currently written, it gives the right answer on hugetlbfs head, but nonsense on hugetlbfs tails. Fix that by calling hugetlbfs-specific hugetlb_basepage_index() on PageHuge tails as well as on head. Yes, it's unconventional to declare hugetlb_basepage_index() there in pagemap.h, rather than in hugetlb.h; but I do not expect anything but page_to_pgoff() ever to need it. [akpm@linux-foundation.org: give hugetlb_basepage_index() prototype the correct scope] Link: https://lkml.kernel.org/r/b17d946b-d09-326e-b42a-52884c36df32@google.com Fixes: `800d8c63b2` ("shmem: add huge pages support") Reported-by: Neel Natu <neelnatu@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Zhang Yi <wetpzy@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Darren Hart <dvhart@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Note on stable backport: leave redundant #include <linux/hugetlb.h> in kernel/futex.c, to avoid conflict over the header files included. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-06-30 08:47:55 -04:00
Petr Mladek	42f11f0fe9	kthread: prevent deadlock when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync() commit 5fa54346caf67b4b1b10b1f390316ae466da4d53 upstream. The system might hang with the following backtrace: schedule+0x80/0x100 schedule_timeout+0x48/0x138 wait_for_common+0xa4/0x134 wait_for_completion+0x1c/0x2c kthread_flush_work+0x114/0x1cc kthread_cancel_work_sync.llvm.16514401384283632983+0xe8/0x144 kthread_cancel_delayed_work_sync+0x18/0x2c xxxx_pm_notify+0xb0/0xd8 blocking_notifier_call_chain_robust+0x80/0x194 pm_notifier_call_chain_robust+0x28/0x4c suspend_prepare+0x40/0x260 enter_state+0x80/0x3f4 pm_suspend+0x60/0xdc state_store+0x108/0x144 kobj_attr_store+0x38/0x88 sysfs_kf_write+0x64/0xc0 kernfs_fop_write_iter+0x108/0x1d0 vfs_write+0x2f4/0x368 ksys_write+0x7c/0xec It is caused by the following race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync(): CPU0 CPU1 Context: Thread A Context: Thread B kthread_mod_delayed_work() spin_lock() __kthread_cancel_work() spin_unlock() del_timer_sync() kthread_cancel_delayed_work_sync() spin_lock() __kthread_cancel_work() spin_unlock() del_timer_sync() spin_lock() work->canceling++ spin_unlock spin_lock() queue_delayed_work() // dwork is put into the worker->delayed_work_list spin_unlock() kthread_flush_work() // flush_work is put at the tail of the dwork wait_for_completion() Context: IRQ kthread_delayed_work_timer_fn() spin_lock() list_del_init(&work->node); spin_unlock() BANG: flush_work is not longer linked and will never get proceed. The problem is that kthread_mod_delayed_work() checks work->canceling flag before canceling the timer. A simple solution is to (re)check work->canceling after __kthread_cancel_work(). But then it is not clear what should be returned when __kthread_cancel_work() removed the work from the queue (list) and it can't queue it again with the new @delay. The return value might be used for reference counting. The caller has to know whether a new work has been queued or an existing one was replaced. The proper solution is that kthread_mod_delayed_work() will remove the work from the queue (list) _only_ when work->canceling is not set. The flag must be checked after the timer is stopped and the remaining operations can be done under worker->lock. Note that kthread_mod_delayed_work() could remove the timer and then bail out. It is fine. The other canceling caller needs to cancel the timer as well. The important thing is that the queue (list) manipulation is done atomically under worker->lock. Link: https://lkml.kernel.org/r/20210610133051.15337-3-pmladek@suse.com Fixes: `9a6b06c8d9` ("kthread: allow to modify delayed kthread work") Signed-off-by: Petr Mladek <pmladek@suse.com> Reported-by: Martin Liu <liumartin@google.com> Cc: <jenhaochen@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-06-30 08:47:51 -04:00
Petr Mladek	06ab015d18	kthread_worker: split code for canceling the delayed work timer commit 34b3d5344719d14fd2185b2d9459b3abcb8cf9d8 upstream. Patch series "kthread_worker: Fix race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync()". This patchset fixes the race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync() including proper return value handling. This patch (of 2): Simple code refactoring as a preparation step for fixing a race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync(). It does not modify the existing behavior. Link: https://lkml.kernel.org/r/20210610133051.15337-2-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: <jenhaochen@google.com> Cc: Martin Liu <liumartin@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-06-30 08:47:51 -04:00
Mimi Zohar	e2dc07ca4e	module: limit enabling module.sig_enforce [ Upstream commit 0c18f29aae7ce3dadd26d8ee3505d07cc982df75 ] Irrespective as to whether CONFIG_MODULE_SIG is configured, specifying "module.sig_enforce=1" on the boot command line sets "sig_enforce". Only allow "sig_enforce" to be set when CONFIG_MODULE_SIG is configured. This patch makes the presence of /sys/module/module/parameters/sig_enforce dependent on CONFIG_MODULE_SIG=y. Fixes: `fda784e50a` ("module: export module signature enforcement status") Reported-by: Nayna Jain <nayna@linux.ibm.com> Tested-by: Mimi Zohar <zohar@linux.ibm.com> Tested-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Mimi Zohar <zohar@linux.ibm.com> Signed-off-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-06-30 08:47:42 -04:00
Andrey Zhizhikin	db8ff65069	This is the 5.4.128 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmDTLA4ACgkQONu9yGCS aT45xg/+IvxFaIOtutEBkFCJvEurRWSozjBKAfX9xtJQGSSKVyDvh7GZWfEXMxZc oNf8DWQKvaiZj2mRdgYp6Ilo27Ps6aN3vCo09z+U3mfGQLMbNpPYEvSq6Twl26NB 8lL8b++0Jo7P+eOALHohBS125/E0etqhoc2HXDFp6pfksj6J7klxlyQ2NX9Ih8xm l7Cto5flCHM9g20/CNsqxXPWiuBKnzSvp9YH9HMDgjOV6YSktLGTHAJ8omjPm0V/ pQVFOo4Kyx34exdA/IzrM/yV4iDThVtwL6+bNErWtl6LwiIcNK3esARYTNjbBBhK W156adxp6kl6LqMADr/y77WqvcH6H2PhpRnMj+6t21FpK7cTbXfqvxBfpOvE1Buh in95LJN1Iins1PTozBVHcUIpdESO5AN8/2aHq0LRLmVbaLlo6aj+sjdHNPvf7HwW 8LDHtpGNao/spMuZmvvH+6i3iwuciINCRY9TVBDgkT5LhWhRHBl6+uSLEX/d+s3Z 663Q6HPu+cfubR7UC8+QsMMtf7KD2yvQuadAz6n/Z41vvSYIUHPGsYtZUmsef3jP n4CTAmGavtyR5jaQNkuw8nnIn7cthONw94foFheBH0doxmkXPKcwqmWO9DH77n58 unMT31ArVg9ObrO/YmLjEaV9X7VlfRf6yw7tey1RJXgrSD3nwgk= =9+GF -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmDTTjIACgkQ7G51OISz Hs1mJw/5AaIlIv+kMC0qP9aG5Y1uoUQ+fewr4Ofo9Kt8QlVcogdeM16URF7O2CDG mICWVXRqY+xWl/84XWJs11RRxfMnXp85myi1jyalOFDYojJp2qOkOyEi8zMg7wui N1kkPxEworm6ILrZRKNgT36ZfSEGAD2ogkp4wS/Fldwh2RNHtGUSw3xAPcqg0yDq +Ve3V06MDul/dyRup3MYtQ/Kxb5wCc+Lfwh+eXYZIu/DfXy3E4K3w8YB7NPazlTw dBlWU/S3Y0uyoq+DvfFQ+nL0GTe0ezGQB5n5n1eNRWg+mEn3IiWXkrBJfha2Swcl vv+Fy/5ro9elcn1oQF5uvwDoGHhq/KkR7TjRICY9zmVvJ79IcqZJ/Bdg5CEXM+xc EvZDAdXn26y2xKZJzKGQSe7QJurOuQpdW497h1DLGIhDs2sGDjhXuISPcNpjPJKG PdgpJ9kG4v2cYnW58UTdzQKNFg7jjFUiiQV/IIoiVlxhg7dVQnSKCgXHcngiZiuQ eUR9p7ahfEOlG2N5qTlJqom0HtiJ6I4a5TDQhXoEYlKfxr9711f2+cramhf2cle+ +G64chvmKcv7Fyx5yIwSJrTJfi6vsuW5VMfXrvGIqjFUEjIGe9rQMhWH34A7s5tG RD0Tg90JJr0rNsNa+LZNPbRynbdPBbrVWdrVyXN+9HDs40oe8BY= =f7qZ -----END PGP SIGNATURE----- Merge tag 'v5.4.128' into 5.4-2.3.x-imx This is the 5.4.128 stable release Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>	2021-06-23 15:07:27 +00:00
Steven Rostedt (VMware)	c7660ab812	tracing: Do no increment trace_clock_global() by one commit 89529d8b8f8daf92d9979382b8d2eb39966846ea upstream. The trace_clock_global() tries to make sure the events between CPUs is somewhat in order. A global value is used and updated by the latest read of a clock. If one CPU is ahead by a little, and is read by another CPU, a lock is taken, and if the timestamp of the other CPU is behind, it will simply use the other CPUs timestamp. The lock is also only taken with a "trylock" due to tracing, and strange recursions can happen. The lock is not taken at all in NMI context. In the case where the lock is not able to be taken, the non synced timestamp is returned. But it will not be less than the saved global timestamp. The problem arises because when the time goes "backwards" the time returned is the saved timestamp plus 1. If the lock is not taken, and the plus one to the timestamp is returned, there's a small race that can cause the time to go backwards! CPU0 CPU1 ---- ---- trace_clock_global() { ts = clock() [ 1000 ] trylock(clock_lock) [ success ] global_ts = ts; [ 1000 ] <interrupted by NMI> trace_clock_global() { ts = clock() [ 999 ] if (ts < global_ts) ts = global_ts + 1 [ 1001 ] trylock(clock_lock) [ fail ] return ts [ 1001] } unlock(clock_lock); return ts; [ 1000 ] } trace_clock_global() { ts = clock() [ 1000 ] if (ts < global_ts) [ false 1000 == 1000 ] trylock(clock_lock) [ success ] global_ts = ts; [ 1000 ] unlock(clock_lock) return ts; [ 1000 ] } The above case shows to reads of trace_clock_global() on the same CPU, but the second read returns one less than the first read. That is, time when backwards, and this is not what is allowed by trace_clock_global(). This was triggered by heavy tracing and the ring buffer checker that tests for the clock going backwards: Ring buffer clock went backwards: 20613921464 -> 20613921463 ------------[ cut here ]------------ WARNING: CPU: 2 PID: 0 at kernel/trace/ring_buffer.c:3412 check_buffer+0x1b9/0x1c0 Modules linked in: [..] [CPU: 2]TIME DOES NOT MATCH expected:20620711698 actual:20620711697 delta:6790234 before:20613921463 after:20613921463 [20613915818] PAGE TIME STAMP [20613915818] delta:0 [20613915819] delta:1 [20613916035] delta:216 [20613916465] delta:430 [20613916575] delta:110 [20613916749] delta:174 [20613917248] delta:499 [20613917333] delta:85 [20613917775] delta:442 [20613917921] delta:146 [20613918321] delta:400 [20613918568] delta:247 [20613918768] delta:200 [20613919306] delta:538 [20613919353] delta:47 [20613919980] delta:627 [20613920296] delta:316 [20613920571] delta:275 [20613920862] delta:291 [20613921152] delta:290 [20613921464] delta:312 [20613921464] delta:0 TIME EXTEND [20613921464] delta:0 This happened more than once, and always for an off by one result. It also started happening after commit aafe104aa9096 was added. Cc: stable@vger.kernel.org Fixes: aafe104aa9096 ("tracing: Restructure trace_clock_global() to never block") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-06-23 14:41:28 +02:00
Steven Rostedt (VMware)	79894a5d75	tracing: Do not stop recording comms if the trace file is being read commit 4fdd595e4f9a1ff6d93ec702eaecae451cfc6591 upstream. A while ago, when the "trace" file was opened, tracing was stopped, and code was added to stop recording the comms to saved_cmdlines, for mapping of the pids to the task name. Code has been added that only records the comm if a trace event occurred, and there's no reason to not trace it if the trace file is opened. Cc: stable@vger.kernel.org Fixes: `7ffbd48d5c` ("tracing: Cache comms only after an event occurred") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-06-23 14:41:28 +02:00
Steven Rostedt (VMware)	4ab1152bb7	tracing: Do not stop recording cmdlines when tracing is off commit 85550c83da421fb12dc1816c45012e1e638d2b38 upstream. The saved_cmdlines is used to map pids to the task name, such that the output of the tracing does not just show pids, but also gives a human readable name for the task. If the name is not mapped, the output looks like this: <...>-1316 [005] ...2 132.044039: ... Instead of this: gnome-shell-1316 [005] ...2 132.044039: ... The names are updated when tracing is running, but are skipped if tracing is stopped. Unfortunately, this stops the recording of the names if the top level tracer is stopped, and not if there's other tracers active. The recording of a name only happens when a new event is written into a ring buffer, so there is no need to test if tracing is on or not. If tracing is off, then no event is written and no need to test if tracing is off or not. Remove the check, as it hides the names of tasks for events in the instance buffers. Cc: stable@vger.kernel.org Fixes: `7ffbd48d5c` ("tracing: Cache comms only after an event occurred") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-06-23 14:41:28 +02:00
Andrey Zhizhikin	5134d8a627	This is the 5.4.126 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmDJy8EACgkQONu9yGCS aT4iaBAAxDY3xDhlTee0ajZwowSVBgjGeVuVFkoKFHNFHu190lXikaiaULlkcfot MPPjUJMH4jM+i7eUC4LWKbxgzphMyAQszAZSd7MWNzgalOids0D77VqZXwEaZIqI wam6/vd9lIeJNf/H8kinm+SDozuv0MkECSHCquTaivvCTYTMK9qKwUPk+E8h8t2o dJ4kJJY24mDRp6F+jnY0nR1CaHXSltxlMG8Viy8HJLsLKe7cRGhkXzCdLnf1m5ea ppYiaLdsD1prlrNDnXWhd8zzRYh4AEnTpEMZuFt4U0oQ7L7D0mCZHw/13SyP21z8 dzE54J4d368RZ0tT5XgiMmxd+4eBYo6Xmlj3+DmADUUHeSddXuJcLLlzDUWgHak2 5z+eqS8yh1lU2NFcMCqcgia1XBx+R1n+Ibt8IF/nyh8/PhdQ22wTtl+lfvR/G1pb w6GIoFB/jvuQ+x/RoWcmaeaqu0TKlIbtqwKOgcTNGtNTh7H7V5sDHr9VtHK5hKVE 6Jdyq+lmIgYIV2wCV3vmfLLlr0NIEDmV7NET/q0VgyWhp+kNSBVYivFeKFGMQba3 ES5LseCUwTzpCv7uFO48gQK+0v3lqmkgWjU1tFDGZKCh9Dlj6eNrwntYyIJPk2hx Dh5u3nrOeRuVYRP5W6ejWOTDDA12+diwtSMBM47vtHMYr+gBUXk= =JTCj -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmDKAVwACgkQ7G51OISz Hs372A//cLcSXeASJjtYGAEqf9Yj0ZMHrY7NSb52u6/oz95VP3zVlxVAsyxhHcaN jXlcne5V5oTx86YI0QNs/wGEO/ErwhRFlN6t8NX0bmBBOj9PrvdguJlA4HSVk4xs wdBDgLtWwAKfv508KXfsX58le7oGszXst3sb9ovTlWwb56Ux4E9nIvWwvdao2Z3v hfs8uzpkhmgEHdRvrFstpxJH7ewmpG24Pr6extkyk4gfXAXVfa7/nOsRceVbMZ8K ln9Yam6MaDB4HH29enXVhK+b9rL7UfM4wwW5JSPlXhH2REfYze+r2HYorN2Bm8mR snClc+NlrpHyE8oFeydu3ZX1caXMnM07EaG6l/l33Tot8PczaTOeSZOezHFEGFAY juOE0YgqxemlNCCJUKNXfdkkPhd5wfrlzJjbNv+84hahUUAT5ET4HmuSUcTzvPkv tl61nmvsRzx635YNs5P7Z0Q5sWwxWKAMoTRM8+0DA1Uov+Hevuy2EHtg01YIsMAc E4RloWALKgbJ/yYG/hlx/xG6JAu8dMkFoaG7d6xCpNu/5oX3QnnP5EgapOQio/fr HZKO96hGdUTbwNkgbtnN6djGIcEsgnCc8JYg6h75ROalB+HHVszynKW97AwaAWTQ MV6oT1eYAJ1vjrP+XqH0ibkV10tC9Vrxs/hEeO6r7vvlVS0zytE= =jJ3r -----END PGP SIGNATURE----- Merge tag 'v5.4.126' into 5.4-2.3.x-imx This is the 5.4.126 stable release Conflicts: - drivers/usb/cdns3/gadget.c: Skip upstream commit f0509160f25e3 ("usb: cdns3: Fix runtime PM imbalance on error") as the implementation is not present in the NXP tree to apply it. Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>	2021-06-16 13:49:01 +00:00
Liangyan	d63f00ec90	tracing: Correct the length check which causes memory corruption commit 3e08a9f9760f4a70d633c328a76408e62d6f80a3 upstream. We've suffered from severe kernel crashes due to memory corruption on our production environment, like, Call Trace: [1640542.554277] general protection fault: 0000 [#1] SMP PTI [1640542.554856] CPU: 17 PID: 26996 Comm: python Kdump: loaded Tainted:G [1640542.556629] RIP: 0010:kmem_cache_alloc+0x90/0x190 [1640542.559074] RSP: 0018:ffffb16faa597df8 EFLAGS: 00010286 [1640542.559587] RAX: 0000000000000000 RBX: 0000000000400200 RCX: 0000000006e931bf [1640542.560323] RDX: 0000000006e931be RSI: 0000000000400200 RDI: ffff9a45ff004300 [1640542.560996] RBP: 0000000000400200 R08: 0000000000023420 R09: 0000000000000000 [1640542.561670] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff9a20608d [1640542.562366] R13: ffff9a45ff004300 R14: ffff9a45ff004300 R15: 696c662f65636976 [1640542.563128] FS: 00007f45d7c6f740(0000) GS:ffff9a45ff840000(0000) knlGS:0000000000000000 [1640542.563937] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1640542.564557] CR2: 00007f45d71311a0 CR3: 000000189d63e004 CR4: 00000000003606e0 [1640542.565279] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [1640542.566069] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [1640542.566742] Call Trace: [1640542.567009] anon_vma_clone+0x5d/0x170 [1640542.567417] __split_vma+0x91/0x1a0 [1640542.567777] do_munmap+0x2c6/0x320 [1640542.568128] vm_munmap+0x54/0x70 [1640542.569990] __x64_sys_munmap+0x22/0x30 [1640542.572005] do_syscall_64+0x5b/0x1b0 [1640542.573724] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [1640542.575642] RIP: 0033:0x7f45d6e61e27 James Wang has reproduced it stably on the latest 4.19 LTS. After some debugging, we finally proved that it's due to ftrace buffer out-of-bound access using a debug tool as follows: [ 86.775200] BUG: Out-of-bounds write at addr 0xffff88aefe8b7000 [ 86.780806] no_context+0xdf/0x3c0 [ 86.784327] __do_page_fault+0x252/0x470 [ 86.788367] do_page_fault+0x32/0x140 [ 86.792145] page_fault+0x1e/0x30 [ 86.795576] strncpy_from_unsafe+0x66/0xb0 [ 86.799789] fetch_memory_string+0x25/0x40 [ 86.804002] fetch_deref_string+0x51/0x60 [ 86.808134] kprobe_trace_func+0x32d/0x3a0 [ 86.812347] kprobe_dispatcher+0x45/0x50 [ 86.816385] kprobe_ftrace_handler+0x90/0xf0 [ 86.820779] ftrace_ops_assist_func+0xa1/0x140 [ 86.825340] 0xffffffffc00750bf [ 86.828603] do_sys_open+0x5/0x1f0 [ 86.832124] do_syscall_64+0x5b/0x1b0 [ 86.835900] entry_SYSCALL_64_after_hwframe+0x44/0xa9 commit b220c049d519 ("tracing: Check length before giving out the filter buffer") adds length check to protect trace data overflow introduced in `0fc1b09ff1`, seems that this fix can't prevent overflow entirely, the length check should also take the sizeof entry->array[0] into account, since this array[0] is filled the length of trace data and occupy addtional space and risk overflow. Link: https://lkml.kernel.org/r/20210607125734.1770447-1-liangyan.peng@linux.alibaba.com Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@redhat.com> Cc: Xunlei Pang <xlpang@linux.alibaba.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Fixes: b220c049d519 ("tracing: Check length before giving out the filter buffer") Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Reviewed-by: yinbinbin <yinbinbin@alibabacloud.com> Reviewed-by: Wetp Zhang <wetp.zy@linux.alibaba.com> Tested-by: James Wang <jnwang@linux.alibaba.com> Signed-off-by: Liangyan <liangyan.peng@linux.alibaba.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-06-16 11:59:46 +02:00
Steven Rostedt (VMware)	7e4e824b10	ftrace: Do not blindly read the ip address in ftrace_bug() commit 6c14133d2d3f768e0a35128faac8aa6ed4815051 upstream. It was reported that a bug on arm64 caused a bad ip address to be used for updating into a nop in ftrace_init(), but the error path (rightfully) returned -EINVAL and not -EFAULT, as the bug caused more than one error to occur. But because -EINVAL was returned, the ftrace_bug() tried to report what was at the location of the ip address, and read it directly. This caused the machine to panic, as the ip was not pointing to a valid memory address. Instead, read the ip address with copy_from_kernel_nofault() to safely access the memory, and if it faults, report that the address faulted, otherwise report what was in that location. Link: https://lore.kernel.org/lkml/20210607032329.28671-1-mark-pk.tsai@mediatek.com/ Cc: stable@vger.kernel.org Fixes: `05736a427f` ("ftrace: warn on failure to disable mcount callers") Reported-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com> Tested-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-06-16 11:59:45 +02:00
Vincent Guittot	71c751cbb9	sched/fair: Make sure to update tg contrib for blocked load commit 02da26ad5ed6ea8680e5d01f20661439611ed776 upstream. During the update of fair blocked load (__update_blocked_fair()), we update the contribution of the cfs in tg->load_avg if cfs_rq's pelt has decayed. Nevertheless, the pelt values of a cfs_rq could have been recently updated while propagating the change of a child. In this case, cfs_rq's pelt will not decayed because it has already been updated and we don't update tg->load_avg. __update_blocked_fair ... for_each_leaf_cfs_rq_safe: child cfs_rq update cfs_rq_load_avg() for child cfs_rq ... update_load_avg(cfs_rq_of(se), se, 0) ... update cfs_rq_load_avg() for parent cfs_rq -propagation of child's load makes parent cfs_rq->load_sum becoming null -UPDATE_TG is not set so it doesn't update parent cfs_rq->tg_load_avg_contrib .. for_each_leaf_cfs_rq_safe: parent cfs_rq update cfs_rq_load_avg() for parent cfs_rq - nothing to do because parent cfs_rq has already been updated recently so cfs_rq->tg_load_avg_contrib is not updated ... parent cfs_rq is decayed list_del_leaf_cfs_rq parent cfs_rq - but it still contibutes to tg->load_avg we must set UPDATE_TG flags when propagting pending load to the parent Fixes: `039ae8bcf7` ("sched/fair: Fix O(nr_cgroups) in the load balancing path") Reported-by: Odin Ugedal <odin@uged.al> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Odin Ugedal <odin@uged.al> Link: https://lkml.kernel.org/r/20210527122916.27683-3-vincent.guittot@linaro.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-06-16 11:59:44 +02:00
Marco Elver	26ab08df86	perf: Fix data race between pin_count increment/decrement commit 6c605f8371159432ec61cbb1488dcf7ad24ad19a upstream. KCSAN reports a data race between increment and decrement of pin_count: write to 0xffff888237c2d4e0 of 4 bytes by task 15740 on cpu 1: find_get_context kernel/events/core.c:4617 __do_sys_perf_event_open kernel/events/core.c:12097 [inline] __se_sys_perf_event_open kernel/events/core.c:11933 ... read to 0xffff888237c2d4e0 of 4 bytes by task 15743 on cpu 0: perf_unpin_context kernel/events/core.c:1525 [inline] __do_sys_perf_event_open kernel/events/core.c:12328 [inline] __se_sys_perf_event_open kernel/events/core.c:11933 ... Because neither read-modify-write here is atomic, this can lead to one of the operations being lost, resulting in an inconsistent pin_count. Fix it by adding the missing locking in the CPU-event case. Fixes: `fe4b04fa31` ("perf: Cure task_oncpu_function_call() races") Reported-by: syzbot+142c9018f5962db69c7e@syzkaller.appspotmail.com Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210527104711.2671610-1-elver@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-06-16 11:59:44 +02:00
Alexander Kuznetsov	4d14a82ef1	cgroup1: don't allow '\n' in renaming commit b7e24eb1caa5f8da20d405d262dba67943aedc42 upstream. cgroup_mkdir() have restriction on newline usage in names: $ mkdir $'/sys/fs/cgroup/cpu/test\ntest2' mkdir: cannot create directory '/sys/fs/cgroup/cpu/test\ntest2': Invalid argument But in cgroup1_rename() such check is missed. This allows us to make /proc/<pid>/cgroup unparsable: $ mkdir /sys/fs/cgroup/cpu/test $ mv /sys/fs/cgroup/cpu/test $'/sys/fs/cgroup/cpu/test\ntest2' $ echo $$ > $'/sys/fs/cgroup/cpu/test\ntest2' $ cat /proc/self/cgroup 11:pids:/ 10:freezer:/ 9:hugetlb:/ 8:cpuset:/ 7:blkio:/user.slice 6:memory:/user.slice 5:net_cls,net_prio:/ 4:perf_event:/ 3:devices:/user.slice 2:cpu,cpuacct:/test test2 1:name=systemd:/ 0::/ Signed-off-by: Alexander Kuznetsov <wwfq@yandex-team.ru> Reported-by: Andrey Krasichkov <buglloc@yandex-team.ru> Acked-by: Dmitry Yakunin <zeil@yandex-team.ru> Cc: stable@vger.kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-06-16 11:59:40 +02:00
Sergey Senozhatsky	f9e7a38d14	wq: handle VM suspension in stall detection [ Upstream commit 940d71c6462e8151c78f28e4919aa8882ff2054e ] If VCPU is suspended (VM suspend) in wq_watchdog_timer_fn() then once this VCPU resumes it will see the new jiffies value, while it may take a while before IRQ detects PVCLOCK_GUEST_STOPPED on this VCPU and updates all the watchdogs via pvclock_touch_watchdogs(). There is a small chance of misreported WQ stalls in the meantime, because new jiffies is time_after() old 'ts + thresh'. wq_watchdog_timer_fn() { for_each_pool(pool, pi) { if (time_after(jiffies, ts + thresh)) { pr_emerg("BUG: workqueue lockup - pool"); } } } Save jiffies at the beginning of this function and use that value for stall detection. If VM gets suspended then we continue using "old" jiffies value and old WQ touch timestamps. If IRQ at some point restarts the stall detection cycle (pvclock_touch_watchdogs()) then old jiffies will always be before new 'ts + thresh'. Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-06-16 11:59:35 +02:00
Shakeel Butt	92215c1f24	cgroup: disable controllers at parse time [ Upstream commit 45e1ba40837ac2f6f4d4716bddb8d44bd7e4a251 ] This patch effectively reverts the commit `a3e72739b7` ("cgroup: fix too early usage of static_branch_disable()"). The commit `6041186a32` ("init: initialize jump labels before command line option parsing") has moved the jump_label_init() before parse_args() which has made the commit `a3e72739b7` unnecessary. On the other hand there are consequences of disabling the controllers later as there are subsystems doing the controller checks for different decisions. One such incident is reported [1] regarding the memory controller and its impact on memory reclaim code. [1] https://lore.kernel.org/linux-mm/921e53f3-4b13-aab8-4a9e-e83ff15371e4@nec.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: NOMURA JUNICHI(野村　淳一) <junichi.nomura@nec.com> Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Jun'ichi Nomura <junichi.nomura@nec.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-06-16 11:59:35 +02:00
Andrey Zhizhikin	e3ec85be6f	This is the 5.4.123 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmCw0KEACgkQONu9yGCS aT4TxxAAvW/xsJKApUuCTrrdY/YMj96bFT6dcF/17N36Tb+lNRL+Ig+dYlfammlD sp7enkerXrIU+27ngnsJfznJLDXOCUPw77p9H90Kzbo7k3wi7Pt2nqFRyuOwYLRN CLo3gmWNWBReCjYmUK5uzUlGeUnQOtD7pcEMexOVxgMmRsJ1uKU2UGKHCIa8VxS1 /xf8mQJv0lIbm8GLSYzCsP3zYafzc0cY0ST6CvCCX+71ryfKXGTtH0wAUvdvWsTs yvpc7UhFcnW9t2e77w9DmyQRf/nDfsu/ueXqITkJot4fYshA0w9owsNxKdnq/DKV nfIVK/agISYeT44uWOqb9TBUdEz2nWA31eIrXD6/Tf4ZUVw2WQy/UD8l5gUt+G7s Ag4nKSEZP4BAsZbnwnYPUm/qsfLm5l2+Spj4IK6mA7TZKMvjgLiHJIxihYzfeuof q/E9NeH8ejfrO6tnMyTUZzmPRjMImoeAvg+NaldpDlSyDE+DG+l9iIjO66Lg0uif dupHtAXFEjgfqHIRfvq4lW6lRKmUwhUkw0mdGgOSbgjoTktJNdrAVoQgsOIvhYBJ fuhs9PTGT5isekDEotdsj8wuXlVKQ2rMKl98Z1djZXA73INobftgbtFxaVGIhLj/ 7LK6pbnaClT5bHJBi+AvVExOZolPsCxgxaXsESCV4nxppEOjRxA= =+GH7 -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmCz5/oACgkQ7G51OISz Hs3hTg/9H2jbISAaXYNpwhHnBJa4eyinrWYBsZ5r1OlUjc00N6uge8VMzEDAjOO8 Ep8cDwCKJ3nz+h/xwN8D5ERqTCjliTEB88FPxgW5iebn4uTYNLz10rPCVbHu7M6g bK8eNnBCQKFgaRILlG7G6C/5Radm56MoaNV5izfrWjr8xJ3LJEusSkAetl820p3Z FYFLZMPvbe1C5J3wPF85a9WdUHu1lY89vNmac30y9+W4V0ogDHKUX/9gIZTKVNLK G6MkjaOkKscB++sBnLeB/4UT4H/2lKmIY8Q/Lf4jUjm5Pd8DLR9SXxNzyTvaenH7 LkX7/vzxmjrurDyhrd2qk9pMhjs9VvutXuQFHWHMZgDfo3PLpc/TBaWfTCdj/VYR pETYsgiGVJKWw8kDBSlNgooMVuozqj2mAXkYK2ghvjM+975A+rZNobsWCRnU/rER xHiII8br11jFXb8b2EZ4BDCfkQgubsuQmiXKbxVjD11HY6Z7zErAb9UGNuCjKlz3 NxvXtpvjpNG+c95OW0WwgRM9iNCkEwN7GsjhHai9Rd6+olVflzc2r4gVn6sSqVhZ CD2ukwRUEtVxJ3xbd2vZnfp735aB/+UHBm8jQYxIORCNLoIVs9tIGlkeIK92W2A7 oU0PcIwI40DxOwc4a6Ybppx0ePUl6CCxlkn2m3HX+y73CcinS1g= =mDdc -----END PGP SIGNATURE----- Merge tag 'v5.4.123' into 5.4-2.3.x-imx This is the 5.4.123 stable release Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>	2021-05-30 19:31:03 +00:00
Daniel Borkmann	3173c7c807	bpf: No need to simulate speculative domain for immediates commit a7036191277f9fa68d92f2071ddc38c09b1e5ee5 upstream. In 801c6058d14a ("bpf: Fix leakage of uninitialized bpf stack under speculation") we replaced masking logic with direct loads of immediates if the register is a known constant. Given in this case we do not apply any masking, there is also no reason for the operation to be truncated under the speculative domain. Therefore, there is also zero reason for the verifier to branch-off and simulate this case, it only needs to do it for unknown but bounded scalars. As a side-effect, this also enables few test cases that were previously rejected due to simulation under zero truncation. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Piotr Krysiuk <piotras@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-28 13:10:26 +02:00
Daniel Borkmann	2b3cc41d50	bpf: Fix mask direction swap upon off reg sign change commit bb01a1bba579b4b1c5566af24d95f1767859771e upstream. Masking direction as indicated via mask_to_left is considered to be calculated once and then used to derive pointer limits. Thus, this needs to be placed into bpf_sanitize_info instead so we can pass it to sanitize_ptr_alu() call after the pointer move. Piotr noticed a corner case where the off reg causes masking direction change which then results in an incorrect final aux->alu_limit. Fixes: 7fedb63a8307 ("bpf: Tighten speculative pointer arithmetic mask") Reported-by: Piotr Krysiuk <piotras@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Piotr Krysiuk <piotras@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-28 13:10:26 +02:00
Daniel Borkmann	2768f99622	bpf: Wrap aux data inside bpf_sanitize_info container commit 3d0220f6861d713213b015b582e9f21e5b28d2e0 upstream. Add a container structure struct bpf_sanitize_info which holds the current aux info, and update call-sites to sanitize_ptr_alu() to pass it in. This is needed for passing in additional state later on. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Piotr Krysiuk <piotras@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-28 13:10:26 +02:00
Andrey Zhizhikin	d34d22f869	This is the 5.4.122 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmCuHYEACgkQONu9yGCS aT6VkxAAr0kISsNHDXB1tDLsOiPsl+hvFQS7DNw8LRpBN8deqiOMrK2OGnXygAWE q6554BmS7sxJ9oSu4fL+fJpuSTdMPsDEGh5qFFXqdb0wjxw9QaHK1SPbLO0QRMmt OPH9tHtNKY9Udiu1ZXj/HHWkUf41VjBwIUpa1riP1ht7WCPPFAF0yeUjEMDB3ZNe m8CkSDWS6NqFQxQcYBWLTVufVSVyu+MkJS0t50KDQEZFv/12pSRllkJ3M/RdBxNV hvQTIFJr/3jPmk9Q5Vt0ZG2mKCtObYcboDxs5tfKVd03uErMqcchFMpL7DGXBBFx S77URkYra6nJvJJB533SiWYR3zKcihnl8eMmV4NqCTgR+pjf2G7MMEqxJhADvhGu wg5IGqMJID2p7nlkPZtod4pap3VY1zkotKdeTjUm6URnf5G9JkgdvqTUsCPQPuEm WIlEqziZSZxy3bj8mm88116+TyDDb7b9Hu0rz3qYYDOBon2r0uZ+SyfeSD76csnS ncEr2XVSlV12g2WQP/zB+ypLQ8YDJpYcyhAdNS2VQIFgjSxODBUEb76zYYqNHTQC PUrztFbbwJ/iH/SXQjzuRsRB3x4XwNCmRGwTMXTaZYot8ui9Ka4gDY+mZ8T0uTS8 68sFCzb+M+zQf1i72s6Vp9dz5msymSbDQbIf79fJ3lbDA4oVXnA= =vILd -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmCuZ9cACgkQ7G51OISz Hs1wVBAAgGPvST+l20L4AS/D3NtPX8kwnk3JAZJ0YJjSPHfCkSnGmrjNc7PNmbxy uYXOlbUfscJU03Erski3VbSlqLHK7XVr3d4nIVY3ZISjL7iHa6xaRa5SONrwQHvT xUW9OcGgwxLdbBluLi8RmqZH0xf7Ds6uuQRZT5YN35/st7Qj5sKxKQc6IyTHceYb NklevN9xuQGC5PfPeWInRw1y0viSeANRwPlINwLif7vsAVgHoR4ju3Zie/mCIrnI 6GMgMSbJWxWUZ7y2BEgTWRrNatA/eS6KG3Q0vjSzo42XKTTyl673Gq7GqSPqNzyY yjN5MJyi4G1jjOdkFWLZCRjXRssq6BhWJMbaXKYVm/vAN+oqecoaA0xBzUr80z6w ZEqM2j+f7hUwtjIiduDxdApjeaPXnMDINvdPMK2qLgnHDIJobaTLQODdvRRSeK5e LdxLcwnHno08EBeXUO2e2ttTzVolM0p4gr88AsR3Sdnu/HqJxND4bQiot9Nwywo+ 1d247qUjQjx4d3MspJ5Wn3zOE/sEnH4NgNepkUM/lJMwku5P7Dz9WGstPwSbAF4D /19k/c97jUi3cnJIsFBbwEx+bQFR6CAnVIZm/SvsOpOwasF6TcguFjEfdDO2DiV+ VSClDYusmRUSexvRX0kiigZ9JcyBH4y4XnmAe52QypwI/uupvAU= =Lxcg -----END PGP SIGNATURE----- Merge tag 'v5.4.122' into 5.4-2.3.x-imx This is the 5.4.122 stable release Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>	2021-05-26 15:23:00 +00:00
Zqiang	845c2b9d99	locking/mutex: clear MUTEX_FLAGS if wait_list is empty due to signal [ Upstream commit 3a010c493271f04578b133de977e0e5dd2848cea ] When a interruptible mutex locker is interrupted by a signal without acquiring this lock and removed from the wait queue. if the mutex isn't contended enough to have a waiter put into the wait queue again, the setting of the WAITER bit will force mutex locker to go into the slowpath to acquire the lock every time, so if the wait queue is empty, the WAITER bit need to be clear. Fixes: `040a0a3710` ("mutex: Add support for wound/wait style locks") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Zqiang <qiang.zhang@windriver.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210517034005.30828-1-qiang.zhang@windriver.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-05-26 12:05:15 +02:00
Oleg Nesterov	670d34d543	ptrace: make ptrace() fail if the tracee changed its pid unexpectedly [ Upstream commit dbb5afad100a828c97e012c6106566d99f041db6 ] Suppose we have 2 threads, the group-leader L and a sub-theread T, both parked in ptrace_stop(). Debugger tries to resume both threads and does ptrace(PTRACE_CONT, T); ptrace(PTRACE_CONT, L); If the sub-thread T execs in between, the 2nd PTRACE_CONT doesn not resume the old leader L, it resumes the post-exec thread T which was actually now stopped in PTHREAD_EVENT_EXEC. In this case the PTHREAD_EVENT_EXEC event is lost, and the tracer can't know that the tracee changed its pid. This patch makes ptrace() fail in this case until debugger does wait() and consumes PTHREAD_EVENT_EXEC which reports old_pid. This affects all ptrace requests except the "asynchronous" PTRACE_INTERRUPT/KILL. The patch doesn't add the new PTRACE_ option to not complicate the API, and I _hope_ this won't cause any noticeable regression: - If debugger uses PTRACE_O_TRACEEXEC and the thread did an exec and the tracer does a ptrace request without having consumed the exec event, it's 100% sure that the thread the ptracer thinks it is targeting does not exist anymore, or isn't the same as the one it thinks it is targeting. - To some degree this patch adds nothing new. In the scenario above ptrace(L) can fail with -ESRCH if it is called after the execing sub-thread wakes the leader up and before it "steals" the leader's pid. Test-case: #include <stdio.h> #include <unistd.h> #include <signal.h> #include <sys/ptrace.h> #include <sys/wait.h> #include <errno.h> #include <pthread.h> #include <assert.h> void tf(void arg) { execve("/usr/bin/true", NULL, NULL); assert(0); return NULL; } int main(void) { int leader = fork(); if (!leader) { kill(getpid(), SIGSTOP); pthread_t th; pthread_create(&th, NULL, tf, NULL); for (;;) pause(); return 0; } waitpid(leader, NULL, WSTOPPED); ptrace(PTRACE_SEIZE, leader, 0, PTRACE_O_TRACECLONE \| PTRACE_O_TRACEEXEC); waitpid(leader, NULL, 0); ptrace(PTRACE_CONT, leader, 0,0); waitpid(leader, NULL, 0); int status, thread = waitpid(-1, &status, 0); assert(thread > 0 && thread != leader); assert(status == 0x80137f); ptrace(PTRACE_CONT, thread, 0,0); /* * waitid() because waitpid(leader, &status, WNOWAIT) does not * report status. Why ???? * * Why WEXITED? because we have another kernel problem connected * to mt-exec. / siginfo_t info; assert(waitid(P_PID, leader, &info, WSTOPPED\|WEXITED\|WNOWAIT) == 0); assert(info.si_pid == leader && info.si_status == 0x0405); / OK, it sleeps in ptrace(PTRACE_EVENT_EXEC == 0x04) */ assert(ptrace(PTRACE_CONT, leader, 0,0) == -1); assert(errno == ESRCH); assert(leader == waitpid(leader, &status, WNOHANG)); assert(status == 0x04057f); assert(ptrace(PTRACE_CONT, leader, 0,0) == 0); return 0; } Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Simon Marchi <simon.marchi@efficios.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Pedro Alves <palves@redhat.com> Acked-by: Simon Marchi <simon.marchi@efficios.com> Acked-by: Jan Kratochvil <jan.kratochvil@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-05-26 12:05:15 +02:00
Andrey Zhizhikin	2cbb55e591	This is the 5.4.120 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmCkyEcACgkQONu9yGCS aT70Qg//Rv09McvLQ+8E0OilJ7TdT0UthXQFP+uPTu+/HPeHQkCO168cn1hbwD9K i0YfFYB7PqPe/wccHNsmWHSUYCzA9NnwExA84/jofjswkEMMc95x/bow5/xmLe/5 ImkjODPVHuQWewgMfbSmNu7Br4wmQC5U/K4r7hp/Aa0FdTjcHMI6Zw40FGbJrWmq kiqhW9CeagKbxWrihQNLrSB4E5CdpNNkug/zVus2n9jlFT4tltNGSd7bPsxrp7LN EdTfayyPUVZeCoysTNA0WZgz47f+z47vAdIlDHzWCIOZcM1RnJXKA5kFXRf8Fnfa +hyvaHSDqYGdRgZxYMcXLL+/cS4foQ/8iQxZBCMomABM0MNUuoJ5tYR6GVetlRcR 46ZC/5OAvNoKY2Kj4Ky4ROF7aMR3NkYCY6wHUVRcw8778bmuReeLJJPsWojAI+4F pWT08+7OUJYb3hRnGxxzKot6CPztdkpQXfXMy+wyNlNbRZ/ivs9/f/GhdblXy/6T j12LKIh1IOxpB/wi7GRfeABUuC4MU8xqx6FuPDrBgCTMfVig/wcwF27AUr//a0F5 xrrzCrDFNAvuyD1WyYilaxWDHAe2o9ROT0JZ4VB3zu40w2VlTT77aqA174xfQa6b 418Eykw3O11dmsY8AQPTt1HhkDCiewEe4K58CJcmCNEf/inFbvI= =kNQc -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmCk3VMACgkQ7G51OISz Hs0Y+A/9HqeZOFBPE3BlnEKkdxEPp+zoOj/57pWlS+Xr8ySalZXonAAC975voKjS 6IFejLkN83q5yCrwYBJ/PCc5Xm9deBf+6AE3KyaNV7CD5ycSBy8o9Fx7IH81wFXt quiVNlpi7iGiWsm3ubsvDHNgHwQlvN6mbNp+RIqD4YllheUsTXeR05rrfr+TauXv Bv7LEolcyUJo0LTzqPsTlVFG9ec+0qWZb1A7LgOZJFWrS/XaYpoXA49KaNbIZIPj jb+sWzTsXhOMc2yB1A6P8T3jsx7NxK4Y2rRyCk0l5Hm//Jl7Kcx+vbe0yQPf2DOs c0X+0s8Vwk/Ry606XLcy0nIIGePyj++u43r6noB/cN/LnZUbJY3zbywUNvrY1thS YG2MtSiV9TzzKP4xApPF7G/4G6H4HOaC1cQbxc3+7sklJf7Coh3pKv775POIfnLS UrQ2ToQRNWQC4/l/O9h9BYiIPWSAAUd0uBgZzcdQWeWbdPXMdb02cwZ5YuPoZgSV aLsDN8naaQlsZ20wdQ0DaqKoIryHnD0XHNSMkEvZ0bjPxOOtICekixLJQxjwFBQL cx3stcQs5PT8l3PQqc2xnVNTqnL7yWGdMYdfNE+4qTLSiw2s5uXbb7BJwttj9Pit QKVSVUtp12ObX1kJzqhKa9GQEUPYREVBpN7NwJL2E5hVObfT6OE= =ruSa -----END PGP SIGNATURE----- Merge tag 'v5.4.120' into 5.4-2.3.x-imx This is the 5.4.120 stable release Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>	2021-05-19 09:41:35 +00:00
Jia-Ju Bai	a8cfa7aff1	kernel: kexec_file: fix error return code of kexec_calculate_store_digests() [ Upstream commit 31d82c2c787d5cf65fedd35ebbc0c1bd95c1a679 ] When vzalloc() returns NULL to sha_regions, no error return code of kexec_calculate_store_digests() is assigned. To fix this bug, ret is assigned with -ENOMEM in this case. Link: https://lkml.kernel.org/r/20210309083904.24321-1-baijiaju1990@gmail.com Fixes: `a43cac0d9d` ("kexec: split kexec_file syscall code to kexec_file.c") Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Reported-by: TOTE Robot <oslab@tsinghua.edu.cn> Acked-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-05-19 10:08:28 +02:00
Odin Ugedal	043ebbccdd	sched/fair: Fix unfairness caused by missing load decay [ Upstream commit 0258bdfaff5bd13c4d2383150b7097aecd6b6d82 ] This fixes an issue where old load on a cfs_rq is not properly decayed, resulting in strange behavior where fairness can decrease drastically. Real workloads with equally weighted control groups have ended up getting a respective 99% and 1%(!!) of cpu time. When an idle task is attached to a cfs_rq by attaching a pid to a cgroup, the old load of the task is attached to the new cfs_rq and sched_entity by attach_entity_cfs_rq. If the task is then moved to another cpu (and therefore cfs_rq) before being enqueued/woken up, the load will be moved to cfs_rq->removed from the sched_entity. Such a move will happen when enforcing a cpuset on the task (eg. via a cgroup) that force it to move. The load will however not be removed from the task_group itself, making it look like there is a constant load on that cfs_rq. This causes the vruntime of tasks on other sibling cfs_rq's to increase faster than they are supposed to; causing severe fairness issues. If no other task is started on the given cfs_rq, and due to the cpuset it would not happen, this load would never be properly unloaded. With this patch the load will be properly removed inside update_blocked_averages. This also applies to tasks moved to the fair scheduling class and moved to another cpu, and this path will also fix that. For fork, the entity is queued right away, so this problem does not affect that. This applies to cases where the new process is the first in the cfs_rq, issue introduced `3d30544f02` ("sched/fair: Apply more PELT fixes"), and when there has previously been load on the cgroup but the cgroup was removed from the leaflist due to having null PELT load, indroduced in `039ae8bcf7` ("sched/fair: Fix O(nr_cgroups) in the load balancing path"). For a simple cgroup hierarchy (as seen below) with two equally weighted groups, that in theory should get 50/50 of cpu time each, it often leads to a load of 60/40 or 70/30. parent/ cg-1/ cpu.weight: 100 cpuset.cpus: 1 cg-2/ cpu.weight: 100 cpuset.cpus: 1 If the hierarchy is deeper (as seen below), while keeping cg-1 and cg-2 equally weighted, they should still get a 50/50 balance of cpu time. This however sometimes results in a balance of 10/90 or 1/99(!!) between the task groups. $ ps u -C stress USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 18568 1.1 0.0 3684 100 pts/12 R+ 13:36 0:00 stress --cpu 1 root 18580 99.3 0.0 3684 100 pts/12 R+ 13:36 0:09 stress --cpu 1 parent/ cg-1/ cpu.weight: 100 sub-group/ cpu.weight: 1 cpuset.cpus: 1 cg-2/ cpu.weight: 100 sub-group/ cpu.weight: 10000 cpuset.cpus: 1 This can be reproduced by attaching an idle process to a cgroup and moving it to a given cpuset before it wakes up. The issue is evident in many (if not most) container runtimes, and has been reproduced with both crun and runc (and therefore docker and all its "derivatives"), and with both cgroup v1 and v2. Fixes: `3d30544f02` ("sched/fair: Apply more PELT fixes") Fixes: `039ae8bcf7` ("sched/fair: Fix O(nr_cgroups) in the load balancing path") Signed-off-by: Odin Ugedal <odin@uged.al> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210501141950.23622-2-odin@uged.al Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-05-19 10:08:28 +02:00
Quentin Perret	687f523c13	sched: Fix out-of-bound access in uclamp [ Upstream commit 6d2f8909a5fabb73fe2a63918117943986c39b6c ] Util-clamp places tasks in different buckets based on their clamp values for performance reasons. However, the size of buckets is currently computed using a rounding division, which can lead to an off-by-one error in some configurations. For instance, with 20 buckets, the bucket size will be 1024/20=51. A task with a clamp of 1024 will be mapped to bucket id 1024/51=20. Sadly, correct indexes are in range [0,19], hence leading to an out of bound memory access. Clamp the bucket id to fix the issue. Fixes: `69842cba9a` ("sched/uclamp: Add CPU's clamp buckets refcounting") Suggested-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/20210430151412.160913-1-qperret@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-05-19 10:08:28 +02:00
Andrey Zhizhikin	6602fc5788	This is the 5.4.119 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmCeKsMACgkQONu9yGCS aT4cYhAA0qDTHscvm641m/Dv4U9w3gWh2Fs8oPz43+nJ1/8CTrT/gSWA7IRDDHiV Dys2canDVLNYTEx1TqwmHbN3R+nvQTpdz2wuJuSf7GKYQj0n3S99BEN6uxod+puu /M7apBH5npjZKv1DMRUrQ/AUGVUuBQqtN7Hl5hEL8ibI/bsZV8+dhJJ8c8uyJpam peiP5n2lCz5HZ/K5OyEy1jCmWQLIcRN59SmiARy/xk739igoCMUajkY1mV0WVyks SKnZEP7tY1mLLYzpW/ZVSkXurx+ZtL1zUctRt5dh5US4uzNt/sfm8oDzyzvGojd/ iWtXefprJXbI9BGyNaBwwNmzjSXabSkoI75wExxsMQKFZpsq12pz97dwy/pZyU+c NlzbmDQg8+Cs9dKsDw6jUXHYSJ9fb4mk6GOF9u0LXgyq1f15/DzjdzkYLXZ3tTOK exVFs/CKz7Dg6npdO5kl7mg18AxmVH+OJftltF2+MbUohBs2vRDRr+O4cY8Wlc2Q AF85uAE3Mo/yL1pi6O7lMW4ic5yJvTRCX/iPsxyDU8LvxM1Kc7u9CzykX3M0WFLz TsKxfPQvoc6WGf8IWy4j1nXMzXQTHL/6CrfzOSTFngR8eqcsbgU0nkKpZNEtvnxN k30ID+Mcl4B6k6XTECNJUXjcwg+TR+XtKOjwXAIDaVqwoW759BY= =25pJ -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmCe4QMACgkQ7G51OISz Hs1JdQ//aC82eBW+TH85PNlVaYn7DL8cU6zE/mX4dofughg8AIg84v0v30rRGF3m c0bxBhwcah+4TIPzWSs8QAT7pEcvR3Gj5JZrEeIuyXBwKhRSXgZJKbV4XAE9Z23W 1Etxwh28NfLbLEnX6qJnejQhHckt4MK9j4W5Gg2L7o6TrZwB7gLjNPfck6t7DnJc 9Fo4TxwgP+BFu30s9h9eQqv/EIVJPRfiAVgvK4o1CCEHya8bfnJgJDgyxKbdyBg6 IXYfYWGVe26VCR7PreEFJ3iheVVlI9VicFfhRzbsCdYQYYNqa22VfyMFrOO1nyMO SPAQnDRqlpvXIUHZ70puDX9QYji2RhVjf6fnDoQRPlkDaKumO3+zdk8bcCJFk7kF f7BPc8babvAn4WeG2hQOlPSKf+Mcc5mFmh08bsb8y9/w2mNPhg8UkCsaymggabkP GK/MvaBGxSoV8ft+EInyRLN3qlLrinYZDOOTj40Pd/d3HrJQCcShXzOI4SRd73MU +ISKHU9BrsRvVQ5ICjVTRfFMcFDJUt60leVM+QGm6MxQmgAuW/Scy9bNb9sDUyJd 6TUI2stzAxDRrjz8JDIXFyUImC3juL1c4WnrpP5I6ZWIeXIpVlQAH1Wxg18Dj0DC 1W5Fb2cicRRLfSPBgjppeipChm9ecT6Fl+3n7kBo5tvSdiiJEo4= =Wb4x -----END PGP SIGNATURE----- Merge tag 'v5.4.119' into 5.4-2.3.x-imx This is the 5.4.119 stable release Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>	2021-05-14 20:43:43 +00:00
Arnd Bergmann	32e046965f	smp: Fix smp_call_function_single_async prototype commit 1139aeb1c521eb4a050920ce6c64c36c4f2a3ab7 upstream. As of commit `966a967116` ("smp: Avoid using two cache lines for struct call_single_data"), the smp code prefers 32-byte aligned call_single_data objects for performance reasons, but the block layer includes an instance of this structure in the main 'struct request' that is more senstive to size than to performance here, see `4ccafe0320` ("block: unalign call_single_data in struct request"). The result is a violation of the calling conventions that clang correctly points out: block/blk-mq.c:630:39: warning: passing 8-byte aligned argument to 32-byte aligned parameter 2 of 'smp_call_function_single_async' may result in an unaligned pointer access [-Walign-mismatch] smp_call_function_single_async(cpu, &rq->csd); It does seem that the usage of the call_single_data without cache line alignment should still be allowed by the smp code, so just change the function prototype so it accepts both, but leave the default alignment unchanged for the other users. This seems better to me than adding a local hack to shut up an otherwise correct warning in the caller. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Jens Axboe <axboe@kernel.dk> Link: https://lkml.kernel.org/r/20210505211300.3174456-1-arnd@kernel.org [nc: Fix conflicts] Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-14 09:44:33 +02:00
Waiman Long	279749d0d4	sched/debug: Fix cgroup_path[] serialization [ Upstream commit ad789f84c9a145f8a18744c0387cec22ec51651e ] The handling of sysrq key can be activated by echoing the key to /proc/sysrq-trigger or via the magic key sequence typed into a terminal that is connected to the system in some way (serial, USB or other mean). In the former case, the handling is done in a user context. In the latter case, it is likely to be in an interrupt context. Currently in print_cpu() of kernel/sched/debug.c, sched_debug_lock is taken with interrupt disabled for the whole duration of the calls to print_*_stats() and print_rq() which could last for the quite some time if the information dump happens on the serial console. If the system has many cpus and the sched_debug_lock is somehow busy (e.g. parallel sysrq-t), the system may hit a hard lockup panic depending on the actually serial console implementation of the system. The purpose of sched_debug_lock is to serialize the use of the global cgroup_path[] buffer in print_cpu(). The rests of the printk calls don't need serialization from sched_debug_lock. Calling printk() with interrupt disabled can still be problematic if multiple instances are running. Allocating a stack buffer of PATH_MAX bytes is not feasible because of the limited size of the kernel stack. The solution implemented in this patch is to allow only one caller at a time to use the full size group_path[], while other simultaneous callers will have to use shorter stack buffers with the possibility of path name truncation. A "..." suffix will be printed if truncation may have happened. The cgroup path name is provided for informational purpose only, so occasional path name truncation should not be a big problem. Fixes: `efe25c2c7b` ("sched: Reinstate group names in /proc/sched_debug") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210415195426.6677-1-longman@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-05-14 09:44:26 +02:00
Andrey Zhizhikin	ba4e63325c	This is the 5.4.118 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmCacuMACgkQONu9yGCS aT7jXQ/+N95y28rkW+9aG33bMKwodiGO3pax1ZT59SwVICDQQQhK6zXmsVtWP3hv oaDqbfN+ap/Ms0dARSxhq4NxtGc1RX8Jv+0XJ0nJ10JkJqAizNwglhtfA4NDAeB1 w0M4b6vYYpotjReo86ZB8SC870eKUIocJKiayksIvgOTJewvq+4qDqn3h6VKdV3s p9Gxjz/8l2koGfUix+lPvPRx2c7juw49Nje0fWQzfHYUwtOYn8s7e6NZxtIJtYtq F80lqdXjGAXkUCf1omW+6TifSUPfmx1aPgOPBiP8WBlNwJ8hvsq6s+2MGdC+0PkZ 4UPTllSe/Q2g1xbO67yFHNYFYE4PKojZ8NKvJXcp5nvBDNpbiefaRROM7PbkQQmm p1Bayy39Hlsmxb6/d/9HOANOZZeCaF1PchaLviwfkrq64U/Yg2csFHl/uX71fJoT RchzeLRWPCqN91Bm5tgUeBGibqNsfkZNzfbiOEGN7MzZNsU3BZm0KbKpqnXzSvgG 6guZD1m4cjmyT7BzRsSremecIn9n8TmxT/lutAGtUi8TWodWBc3kvtxe3/xBILQ1 MOWhBIhO9/2HAjJ+h/GIFGOrwhGtFmA5x1gGXOSE+Kkxx1jUiPE9zvPFQrgYrdAQ yL25fPyfNO5MTUC2rEF7s0hW5dWbcL7H8r8ZbXSh2oaUokn+a00= =FHFi -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmCaq/0ACgkQ7G51OISz Hs10mA//ci+tIUNzwXtWW5LAW5gpapDK3vVLwT4e1hrjeHFlHhIAVfRX2CcpLSoj dgpQOFKFKJ3q30GzkJSG0eniurqYNBz2T7N0bozPcKIvvbUVF7JnZECfizC24+kd GcjSUx15vwnHhYbg7cwXb8kkVjzjL/pfGSdlKsUZ5q2dcbIxWH8XVt/DKGaQyq+G zRH46Wc4TLQim4W5nsiNiSZw9L2kj+GTrFFJttG/+/K7OMMXrjiRdpUM90ufUqJm OPRn52myz7R9+SygQ2++5pSXIws2vAp6xd6CSQ+uXF7kxiDGCs+W/+v6TDr/8a4K haONfcWqcnc4IXLkr7m6nCQCqowBRzK6gXYlocWbsfJDBoQw8dzQQbe2EY8wWxwe qjLA9/Y/cR9YKinSdilex9Wixt1S0tsWZVWY4wvsztH24nM1l/jLAxw8ebaoTxv6 5thEsBmZCCJLP7gvqPh0bXyIGki03yZGpa/7D7cU8zNLQTOpY3rHfWzK518mRzrG LDIM9YfW3k/4evN8t5pJEbO+R9oeWyvQV3Ka48fttOYlyy5AMqxlZSEFwypjTz2N hE0x1gBV2l9k6kxErkQs/IWNeOO4xr53TNgepcMis0ADsTLv3J0kGv0tC4Ifvdwf Sl3+TXovEeCcUYSCFCUviPpdSziGQh+w9XxYeDfC7zbNKVHQoqk= =zVUy -----END PGP SIGNATURE----- Merge tag 'v5.4.118' into 5.4-2.3.x-imx This is the 5.4.118 stable release Conflicts (manual resolve): - drivers/mmc/core/core.c: - drivers/mmc/core/host.c: Fix merge fuzz for upstream commit `909a01b951` ("mmc: core: Fix hanging on I/O during system suspend for removable cards") Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>	2021-05-11 16:08:17 +00:00
Steven Rostedt (VMware)	c64da3294a	tracing: Restructure trace_clock_global() to never block commit aafe104aa9096827a429bc1358f8260ee565b7cc upstream. It was reported that a fix to the ring buffer recursion detection would cause a hung machine when performing suspend / resume testing. The following backtrace was extracted from debugging that case: Call Trace: trace_clock_global+0x91/0xa0 __rb_reserve_next+0x237/0x460 ring_buffer_lock_reserve+0x12a/0x3f0 trace_buffer_lock_reserve+0x10/0x50 __trace_graph_return+0x1f/0x80 trace_graph_return+0xb7/0xf0 ? trace_clock_global+0x91/0xa0 ftrace_return_to_handler+0x8b/0xf0 ? pv_hash+0xa0/0xa0 return_to_handler+0x15/0x30 ? ftrace_graph_caller+0xa0/0xa0 ? trace_clock_global+0x91/0xa0 ? __rb_reserve_next+0x237/0x460 ? ring_buffer_lock_reserve+0x12a/0x3f0 ? trace_event_buffer_lock_reserve+0x3c/0x120 ? trace_event_buffer_reserve+0x6b/0xc0 ? trace_event_raw_event_device_pm_callback_start+0x125/0x2d0 ? dpm_run_callback+0x3b/0xc0 ? pm_ops_is_empty+0x50/0x50 ? platform_get_irq_byname_optional+0x90/0x90 ? trace_device_pm_callback_start+0x82/0xd0 ? dpm_run_callback+0x49/0xc0 With the following RIP: RIP: 0010:native_queued_spin_lock_slowpath+0x69/0x200 Since the fix to the recursion detection would allow a single recursion to happen while tracing, this lead to the trace_clock_global() taking a spin lock and then trying to take it again: ring_buffer_lock_reserve() { trace_clock_global() { arch_spin_lock() { queued_spin_lock_slowpath() { /* lock taken / (something else gets traced by function graph tracer) ring_buffer_lock_reserve() { trace_clock_global() { arch_spin_lock() { queued_spin_lock_slowpath() { / DEAD LOCK! / Tracing should never* block, as it can lead to strange lockups like the above. Restructure the trace_clock_global() code to instead of simply taking a lock to update the recorded "prev_time" simply use it, as two events happening on two different CPUs that calls this at the same time, really doesn't matter which one goes first. Use a trylock to grab the lock for updating the prev_time, and if it fails, simply try again the next time. If it failed to be taken, that means something else is already updating it. Link: https://lkml.kernel.org/r/20210430121758.650b6e8a@gandalf.local.home Cc: stable@vger.kernel.org Tested-by: Konstantin Kharlamov <hi-angel@yandex.ru> Tested-by: Todd Brandt <todd.e.brandt@linux.intel.com> Fixes: b02414c8f045 ("ring-buffer: Fix recursion protection transitions between interrupt context") # started showing the problem Fixes: `14131f2f98` ("tracing: implement trace_clock_*() APIs") # where the bug happened Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212761 Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-11 14:04:18 +02:00
Steven Rostedt (VMware)	0834094c9a	tracing: Map all PIDs to command lines commit 785e3c0a3a870e72dc530856136ab4c8dd207128 upstream. The default max PID is set by PID_MAX_DEFAULT, and the tracing infrastructure uses this number to map PIDs to the comm names of the tasks, such output of the trace can show names from the recorded PIDs in the ring buffer. This mapping is also exported to user space via the "saved_cmdlines" file in the tracefs directory. But currently the mapping expects the PIDs to be less than PID_MAX_DEFAULT, which is the default maximum and not the real maximum. Recently, systemd will increases the maximum value of a PID on the system, and when tasks are traced that have a PID higher than PID_MAX_DEFAULT, its comm is not recorded. This leads to the entire trace to have "<...>" as the comm name, which is pretty useless. Instead, keep the array mapping the size of PID_MAX_DEFAULT, but instead of just mapping the index to the comm, map a mask of the PID (PID_MAX_DEFAULT - 1) to the comm, and find the full PID from the map_cmdline_to_pid array (that already exists). This bug goes back to the beginning of ftrace, but hasn't been an issue until user space started increasing the maximum value of PIDs. Link: https://lkml.kernel.org/r/20210427113207.3c601884@gandalf.local.home Cc: stable@vger.kernel.org Fixes: `bc0c38d139` ("ftrace: latency tracer infrastructure") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-11 14:04:18 +02:00
Masahiro Yamada	79c95130a5	kbuild: update config_data.gz only when the content of .config is changed commit 46b41d5dd8019b264717978c39c43313a524d033 upstream. If the timestamp of the .config file is updated, config_data.gz is regenerated, then vmlinux is re-linked. This occurs even if the content of the .config has not changed at all. This issue was mitigated by commit `67424f61f8` ("kconfig: do not write .config if the content is the same"); Kconfig does not update the .config when it ends up with the identical configuration. The issue is remaining when the .config is created by _defconfig with some config fragment(s) applied on top. This is typical for powerpc and mips, where several _defconfig targets are constructed by using merge_config.sh. One workaround is to have the copy of the .config. The filechk rule updates the copy, kernel/config_data, by checking the content instead of the timestamp. With this commit, the second run with the same configuration avoids the needless rebuilds. $ make ARCH=mips defconfig all [ snip ] $ make ARCH=mips defconfig all *** Default configuration is based on target '32r2el_defconfig' Using ./arch/mips/configs/generic_defconfig as base Merging arch/mips/configs/generic/32r2.config Merging arch/mips/configs/generic/el.config Merging ./arch/mips/configs/generic/board-boston.config Merging ./arch/mips/configs/generic/board-ni169445.config Merging ./arch/mips/configs/generic/board-ocelot.config Merging ./arch/mips/configs/generic/board-ranchu.config Merging ./arch/mips/configs/generic/board-sead-3.config Merging ./arch/mips/configs/generic/board-xilfpga.config # # configuration written to .config # SYNC include/config/auto.conf CALL scripts/checksyscalls.sh CALL scripts/atomic/check-atomics.sh CHK include/generated/compile.h CHK include/generated/autoksyms.h Reported-by: Elliot Berman <eberman@codeaurora.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-11 14:04:16 +02:00
Thomas Gleixner	8d2be04dbb	Revert `337f13046f` ("futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op") commit 4fbf5d6837bf81fd7a27d771358f4ee6c4f243f8 upstream. The FUTEX_WAIT operand has historically a relative timeout which means that the clock id is irrelevant as relative timeouts on CLOCK_REALTIME are not subject to wall clock changes and therefore are mapped by the kernel to CLOCK_MONOTONIC for simplicity. If a caller would set FUTEX_CLOCK_REALTIME for FUTEX_WAIT the timeout is still treated relative vs. CLOCK_MONOTONIC and then the wait arms that timeout based on CLOCK_REALTIME which is broken and obviously has never been used or even tested. Reject any attempt to use FUTEX_CLOCK_REALTIME with FUTEX_WAIT again. The desired functionality can be achieved with FUTEX_WAIT_BITSET and a FUTEX_BITSET_MATCH_ANY argument. Fixes: `337f13046f` ("futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210422194704.834797921@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-11 14:04:16 +02:00
Lingutla Chandrasekhar	b56ad4febe	sched/fair: Ignore percpu threads for imbalance pulls [ Upstream commit 9bcb959d05eeb564dfc9cac13a59843a4fb2edf2 ] During load balance, LBF_SOME_PINNED will be set if any candidate task cannot be detached due to CPU affinity constraints. This can result in setting env->sd->parent->sgc->group_imbalance, which can lead to a group being classified as group_imbalanced (rather than any of the other, lower group_type) when balancing at a higher level. In workloads involving a single task per CPU, LBF_SOME_PINNED can often be set due to per-CPU kthreads being the only other runnable tasks on any given rq. This results in changing the group classification during load-balance at higher levels when in reality there is nothing that can be done for this affinity constraint: per-CPU kthreads, as the name implies, don't get to move around (modulo hotplug shenanigans). It's not as clear for userspace tasks - a task could be in an N-CPU cpuset with N-1 offline CPUs, making it an "accidental" per-CPU task rather than an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which we can leverage here to not set LBF_SOME_PINNED. Note that the aforementioned classification to group_imbalance (when nothing can be done) is especially problematic on big.LITTLE systems, which have a topology the likes of: DIE [ ] MC [ ][ ] 0 1 2 3 L L B B arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B) Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads can significantly delay the upgmigration of said misfit tasks. Systems relying on ASYM_PACKING are likely to face similar issues. Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org> [Use kthread_is_per_cpu() rather than p->nr_cpus_allowed] [Reword changelog] Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210407220628.3798191-2-valentin.schneider@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-05-11 14:04:11 +02:00
Vitaly Kuznetsov	a629f6bc03	genirq/matrix: Prevent allocation counter corruption [ Upstream commit c93a5e20c3c2dabef8ea360a3d3f18c6f68233ab ] When irq_matrix_free() is called for an unallocated vector the managed_allocated and total_allocated counters get out of sync with the real state of the matrix. Later, when the last interrupt is freed, these counters will underflow resulting in UINTMAX because the counters are unsigned. While this is certainly a problem of the calling code, this can be catched in the allocator by checking the allocation bit for the to be freed vector which simplifies debugging. An example of the problem described above: https://lore.kernel.org/lkml/20210318192819.636943062@linutronix.de/ Add the missing sanity check and emit a warning when it triggers. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210319111823.1105248-1-vkuznets@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-05-11 14:04:05 +02:00
Chen Jun	1c5cb86cdd	posix-timers: Preserve return value in clock_adjtime32() commit 2d036dfa5f10df9782f5278fc591d79d283c1fad upstream. The return value on success (>= 0) is overwritten by the return value of put_old_timex32(). That works correct in the fault case, but is wrong for the success case where put_old_timex32() returns 0. Just check the return value of put_old_timex32() and return -EFAULT in case it is not zero. [ tglx: Massage changelog ] Fixes: `3a4d44b616` ("ntp: Move adjtimex related compat syscalls to native counterparts") Signed-off-by: Chen Jun <chenjun102@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Richard Cochran <richardcochran@gmail.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210414030449.90692-1-chenjun102@huawei.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-11 14:04:04 +02:00
Christoph Hellwig	13b0a28e6f	modules: inherit TAINT_PROPRIETARY_MODULE commit 262e6ae7081df304fc625cf368d5c2cbba2bb991 upstream. If a TAINT_PROPRIETARY_MODULE exports symbol, inherit the taint flag for all modules importing these symbols, and don't allow loading symbols from TAINT_PROPRIETARY_MODULE modules if the module previously imported gplonly symbols. Add a anti-circumvention devices so people don't accidentally get themselves into trouble this way. Comment from Greg: "Ah, the proven-to-be-illegal "GPL Condom" defense :)" [jeyu: pr_info -> pr_err and pr_warn as per discussion] Link: http://lore.kernel.org/r/20200730162957.GA22469@lst.de Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-11 14:04:04 +02:00
Christoph Hellwig	cd5a738e28	modules: return licensing information from find_symbol commit ef1dac6021cc8ec5de02ce31722bf26ac4ed5523 upstream. Report the GPLONLY status through a new argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-11 14:04:04 +02:00
Christoph Hellwig	c4698910a9	modules: rename the licence field in struct symsearch to license commit cd8732cdcc37d7077c4fa2c966b748c0662b607e upstream. Use the same spelling variant as the rest of the file. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-11 14:04:04 +02:00
Christoph Hellwig	7500d49994	modules: unexport __module_address commit 34e64705ad415ed7a816e60ef62b42fe6d1729d9 upstream. __module_address is only used by built-in code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-11 14:04:03 +02:00
Christoph Hellwig	ad6d414703	modules: unexport __module_text_address commit 3fe1e56d0e68b623dd62d8d38265d2a052e7e185 upstream. __module_text_address is only used by built-in code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-11 14:04:03 +02:00
Christoph Hellwig	86de29b833	modules: mark each_symbol_section static commit a54e04914c211b5678602a46b3ede5d82ec1327d upstream. each_symbol_section is only used inside of module.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-11 14:04:03 +02:00
Christoph Hellwig	79100b191e	modules: mark find_symbol static commit 773110470e2fa3839523384ae014f8a723c4d178 upstream. find_symbol is only used in module.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-11 14:04:03 +02:00
Christoph Hellwig	6e38daf2e5	modules: mark ref_module static commit 7ef5264de773279b9f23b6cc8afb5addb30e970b upstream. ref_module isn't used anywhere outside of module.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-11 14:04:03 +02:00
Steven Rostedt (VMware)	0b641b2587	ftrace: Handle commands when closing set_ftrace_filter file commit 8c9af478c06bb1ab1422f90d8ecbc53defd44bc3 upstream. # echo switch_mm:traceoff > /sys/kernel/tracing/set_ftrace_filter will cause switch_mm to stop tracing by the traceoff command. # echo -n switch_mm:traceoff > /sys/kernel/tracing/set_ftrace_filter does nothing. The reason is that the parsing in the write function only processes commands if it finished parsing (there is white space written after the command). That's to handle: write(fd, "switch_mm:", 10); write(fd, "traceoff", 8); cases, where the command is broken over multiple writes. The problem is if the file descriptor is closed, then the write call is not processed, and the command needs to be processed in the release code. The release code can handle matching of functions, but does not handle commands. Cc: stable@vger.kernel.org Fixes: `eda1e32855` ("tracing: handle broken names in ftrace filter") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-11 14:04:01 +02:00
Andrey Zhizhikin	e43d547d72	This is the 5.4.117 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmCVAOoACgkQONu9yGCS aT7hHg//RSTMezyVHL3nU1cpPxTXE030uYYO6E6TGNIEmEfRijsJhZMQW/RPuqeX 4x3Beq7GtwnoDJTxR1FbeF8hFLSzqcyV0780Vknuk0bogM5tPL3HweysAzISbLWm bSDd76hQPTwOXOsgw1VDMwp9YLPYsZ5Bsnh9WuoIXwvBNUCu4pTS1bbs0Dd/vgGg OOf3aXr89Zrz79dvDq9KoE9gg2WVKhOTbK4jghJP2bgHI7hv7Enk87DPZSyHErDP v9SqgSNnREALL87K9UDdosJ5xDQloV0oJ/2AYsETPDKuIRWHo32R3VSAd8XtPrAz YP/Fmg08Q28onsoT+YUps44t1275mtUNFKPd6NHdyqGn+UHjxMG2EdWOR9cvL4Nh RF9B1jb2nq4olgeyt19CqwrYxcxuJOTPn/rUJcCLdyARIfzXOnHCSKJCkk5FSAU9 daU3HQlv5ukdrW7bl9QAVT6a7dJ9wZd9AB4WFghM68QZxHzylseF3ELTDLVcDN8O avGGOU6dc6MmO+7H6dr2s3cV9zBXCUoUD5mAzVzgz6h44Wpn+LK+ks6IWMnTNO9j ER94kMysUEcItguOEpJ1oe0b3Is+rmZsIBuRzneNlo1aGvhd3VCb+Zl9Wd92HwKi lvOwuha9GrLDkmUIaEbLQq9wdl3d6gcA6ldsnghParnrNEv319c= =iO7e -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmCVHn0ACgkQ7G51OISz Hs1SQw/7BpMqSym7Def8uljmntnEqzt3ioxakM0TE2G/F9WJZZ/ESWTE6K6SvjbH DX9wYvvDdEh4UdxprXEDReU9w0szmIqicwlVbmmhRH+e2HsAbtow9yf5YnFzSiDU 3lC8GW3Rn6acNkqq6IFNiEHd525FJD7YwnfTfvMUflLBzm7+GD5+RG/bp9/yktMl CkV4BTSYc95CYEE/S0H47UMp7d1tbSEEZ98B2lgKtPDfhDsjn03Q26c3HPs8c0bR 4cAIOPMRZnv79S+TP75GZhmfapjEyBO3ajzihEYP5uWWWSF4OwZJ1VpGhgusLZ4j W+3sOUkG8x9ppaMQtDrrroSK56H9z1nODil7uLLsGuvf7mcUqj/w1/aR/HfxLrro H59YU4CZAPRjPh3eTdDWvtlsdHRqMjO/oywBCW/01CE+h9k2UHUowQdIV/q+CT+0 nnkZRiAVn0Ex4m1YtGgv9sNjAvkMbSWSOAHVKQS55F1ZEhamIAtijLHB52DIGL8r SYmjT6A22u/spQEZircFtP3PvHSeuLNDx0MDNBkntQTLGZOLKCajWs4y5zsZSAEw /X/6P+f5Fc2lvtUy7+OkJv6l3flXu9IsvHgdaGDdMfiMyI2ui5xp2med/wrceX2o MbihC/LlhRhBAT35RCD7hOtxvXy2Eufn9h89EnjmPHACjo8aQSs= =A+pM -----END PGP SIGNATURE----- Merge tag 'v5.4.117' into 5.4-2.3.x-imx This is the 5.4.117 stable release Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>	2021-05-07 11:03:22 +00:00
Ondrej Mosnacek	b246759284	perf/core: Fix unconditional security_locked_down() call commit 08ef1af4de5fe7de9c6d69f1e22e51b66e385d9b upstream. Currently, the lockdown state is queried unconditionally, even though its result is used only if the PERF_SAMPLE_REGS_INTR bit is set in attr.sample_type. While that doesn't matter in case of the Lockdown LSM, it causes trouble with the SELinux's lockdown hook implementation. SELinux implements the locked_down hook with a check whether the current task's type has the corresponding "lockdown" class permission ("integrity" or "confidentiality") allowed in the policy. This means that calling the hook when the access control decision would be ignored generates a bogus permission check and audit record. Fix this by checking sample_type first and only calling the hook when its result would be honored. Fixes: `b0c8fdc7fd` ("lockdown: Lock down perf when in confidentiality mode") Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Paul Moore <paul@paul-moore.com> Link: https://lkml.kernel.org/r/20210224215628.192519-1-omosnace@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-07 10:51:38 +02:00
Daniel Borkmann	8ba25a9ef9	bpf: Fix leakage of uninitialized bpf stack under speculation commit 801c6058d14a82179a7ee17a4b532cac6fad067f upstream. The current implemented mechanisms to mitigate data disclosure under speculation mainly address stack and map value oob access from the speculative domain. However, Piotr discovered that uninitialized BPF stack is not protected yet, and thus old data from the kernel stack, potentially including addresses of kernel structures, could still be extracted from that 512 bytes large window. The BPF stack is special compared to map values since it's not zero initialized for every program invocation, whereas map values /are/ zero initialized upon their initial allocation and thus cannot leak any prior data in either domain. In the non-speculative domain, the verifier ensures that every stack slot read must have a prior stack slot write by the BPF program to avoid such data leaking issue. However, this is not enough: for example, when the pointer arithmetic operation moves the stack pointer from the last valid stack offset to the first valid offset, the sanitation logic allows for any intermediate offsets during speculative execution, which could then be used to extract any restricted stack content via side-channel. Given for unprivileged stack pointer arithmetic the use of unknown but bounded scalars is generally forbidden, we can simply turn the register-based arithmetic operation into an immediate-based arithmetic operation without the need for masking. This also gives the benefit of reducing the needed instructions for the operation. Given after the work in 7fedb63a8307 ("bpf: Tighten speculative pointer arithmetic mask"), the aux->alu_limit already holds the final immediate value for the offset register with the known scalar. Thus, a simple mov of the immediate to AX register with using AX as the source for the original instruction is sufficient and possible now in this case. Reported-by: Piotr Krysiuk <piotras@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Piotr Krysiuk <piotras@gmail.com> Reviewed-by: Piotr Krysiuk <piotras@gmail.com> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-07 10:51:37 +02:00
Daniel Borkmann	53e0db429b	bpf: Fix masking negation logic upon negative dst register commit b9b34ddbe2076ade359cd5ce7537d5ed019e9807 upstream. The negation logic for the case where the off_reg is sitting in the dst register is not correct given then we cannot just invert the add to a sub or vice versa. As a fix, perform the final bitwise and-op unconditionally into AX from the off_reg, then move the pointer from the src to dst and finally use AX as the source for the original pointer arithmetic operation such that the inversion yields a correct result. The single non-AX mov in between is possible given constant blinding is retaining it as it's not an immediate based operation. Fixes: `979d63d50c` ("bpf: prevent out of bounds speculation on pointer arithmetic") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Piotr Krysiuk <piotras@gmail.com> Reviewed-by: Piotr Krysiuk <piotras@gmail.com> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-07 10:51:37 +02:00
Andrey Zhizhikin	2f748dbefd	This is the 5.4.116 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmCOa1cACgkQONu9yGCS aT41hA/9GEiGllP3kafPxmS5hzcfyywi7Asuy//EpxSHiXoNk97dTrx/aUDYidyC 6PD15GjtUJMCaGZgYtIHhcus0NV/yv55mHOBrH+QT5tDjhd8LiylE6H9S5usAG7Q j99AtVi9zQeEr1H+3yG7RBiffgLt3Qpu3bS/RPhatBECZnciIBe6RFCKyQbJqjmH lVQHh8XCzfR7TeFf2T8mHKU1AEopeXGXtZ9BvjiTlDG3CJ85Gx3sfnLjkSWUQ9ho 1WSrkX/RCO8SuL69cNoT0ydyaCH5OxHQRfIavwq4CMUOc93kOIkbZn8jsmwNCoua gS3spldh3XawQWt87Nurvth587vHakhUENp6kJD3/N6i/gkDc6JmKwo3apPel8Pm Via010r3ofFI/I4YwcowinXW5Sa0gHMn/8PNtZ2XHfAEv7v3QI4WQMdRI9/3+z6v onsmO7Rrc1LFajcOE6oqnzxJhuqBq+TgyvP/Pmp8Q+NCw2tLeUxTYrxzIV0bU9k1 JoBZeJznEEyxOab2UFiE0qP7dEwyweB0pXzYqoZ0XlSHlTkbo0BgNRpF4majGkTK CVtkYHyrHnmnQiu8mbzFf18KS6t07pDE/CtpGul9E9JVt4MBVE3h+TPbNM9QJQfQ iq9xMVeekjKiZxaiFTmhgPB30jCGqvIkmCheFy/2/4Jn9OE0QT4= =nxmW -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmCOu/QACgkQ7G51OISz Hs2YcA//b3ukesOyPt1Yr8XtC5TLVatmtStKW7Zmjdcq2LpUX9cvveydQYcFa6id qV0QKwFU87GR19LTxz+w8GXms/K/F9HTNQRY+idMePRz/MQCP8QiaIqw6nLq9vfj BMAA+V5mbvdN49XS6JcP8u7cH9IMV2hy5KZvJ1deicu7OYGrPHYzz6SMElFAgsfw Af0eVXIU3us6N2gS3CJSYLhLXm+R0zIpr/9AU2U/op6dt234tjlU0IcbsI3L+Jpe znqw5JCM9nN/W0tjFIWFKv0v98f5dhT/vpOOheys2eospffhVSeyWU8ZbJYdvrL7 EnxxdWGGU7fo123+aXoTSjOonMCqX6OGg1Pjl0HXH1ip3wpdYpnhOL2U3pdg2Qtn RUOsqIcohWelpjrpeIeiNpvtONmQDJTGwHS9nkFMYGmROtWOYtdw5cBOE/7e7qYt xD9ecLSDnFS6/OffFNKktt8nxqXy0G0NAsm/Pkkb6fsIdM7WcUAwiiIl/fyCyMR+ VdvU20Du0Z+kW3aX4ILLqDypooyif0L4HJvxGWnvHgTv5ZVdKIBa7zqfWVChUdhy AWAOdfWxkn2zsSfR0mXag5QXVqVHGM3ctX0iVJvXoevn/P0VAV48BLD2JT4irKZI s8+GrY05azfS6nRSnsUa/bJwh300r/pRJGuWpzLdaiWEAFz0lNI= =5CVD -----END PGP SIGNATURE----- Merge tag 'v5.4.116' into 5.4-2.3.x-imx This is the 5.4.116 stable release Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>	2021-05-02 14:49:22 +00:00
Daniel Borkmann	ef4e68f0af	bpf: Tighten speculative pointer arithmetic mask commit 7fedb63a8307dda0ec3b8969a3b233a1dd7ea8e0 upstream. This work tightens the offset mask we use for unprivileged pointer arithmetic in order to mitigate a corner case reported by Piotr and Benedict where in the speculative domain it is possible to advance, for example, the map value pointer by up to value_size-1 out-of-bounds in order to leak kernel memory via side-channel to user space. Before this change, the computed ptr_limit for retrieve_ptr_limit() helper represents largest valid distance when moving pointer to the right or left which is then fed as aux->alu_limit to generate masking instructions against the offset register. After the change, the derived aux->alu_limit represents the largest potential value of the offset register which we mask against which is just a narrower subset of the former limit. For minimal complexity, we call sanitize_ptr_alu() from 2 observation points in adjust_ptr_min_max_vals(), that is, before and after the simulated alu operation. In the first step, we retieve the alu_state and alu_limit before the operation as well as we branch-off a verifier path and push it to the verification stack as we did before which checks the dst_reg under truncation, in other words, when the speculative domain would attempt to move the pointer out-of-bounds. In the second step, we retrieve the new alu_limit and calculate the absolute distance between both. Moreover, we commit the alu_state and final alu_limit via update_alu_sanitation_state() to the env's instruction aux data, and bail out from there if there is a mismatch due to coming from different verification paths with different states. Reported-by: Piotr Krysiuk <piotras@gmail.com> Reported-by: Benedict Schlueter <benedict.schlueter@rub.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Benedict Schlueter <benedict.schlueter@rub.de> [fllinden@amazon.com: backported to 5.4] Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-02 11:05:04 +02:00
Daniel Borkmann	4dc6e55e28	bpf: Move sanitize_val_alu out of op switch commit f528819334881fd622fdadeddb3f7edaed8b7c9b upstream. Add a small sanitize_needed() helper function and move sanitize_val_alu() out of the main opcode switch. In upcoming work, we'll move sanitize_ptr_alu() as well out of its opcode switch so this helps to streamline both. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> [fllinden@amazon.com: backported to 5.4] Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-02 11:05:04 +02:00
Daniel Borkmann	876d1cec93	bpf: Refactor and streamline bounds check into helper commit 073815b756c51ba9d8384d924c5d1c03ca3d1ae4 upstream. Move the bounds check in adjust_ptr_min_max_vals() into a small helper named sanitize_check_bounds() in order to simplify the former a bit. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> [fllinden@amazon.com: backport to 5.4] Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-02 11:05:04 +02:00
Daniel Borkmann	4158e5fea3	bpf: Improve verifier error messages for users commit a6aaece00a57fa6f22575364b3903dfbccf5345d upstream. Consolidate all error handling and provide more user-friendly error messages from sanitize_ptr_alu() and sanitize_val_alu(). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> [fllinden@amazon.com: backport to 5.4] Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-02 11:05:03 +02:00
Daniel Borkmann	15de0c537b	bpf: Rework ptr_limit into alu_limit and add common error path commit b658bbb844e28f1862867f37e8ca11a8e2aa94a3 upstream. Small refactor with no semantic changes in order to consolidate the max ptr_limit boundary check. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-02 11:05:03 +02:00
Daniel Borkmann	f7fbedc909	bpf: Ensure off_reg has no mixed signed bounds for all types commit 24c109bb1537c12c02aeed2d51a347b4d6a9b76e upstream. The mixed signed bounds check really belongs into retrieve_ptr_limit() instead of outside of it in adjust_ptr_min_max_vals(). The reason is that this check is not tied to PTR_TO_MAP_VALUE only, but to all pointer types that we handle in retrieve_ptr_limit() and given errors from the latter propagate back to adjust_ptr_min_max_vals() and lead to rejection of the program, it's a better place to reside to avoid anything slipping through for future types. The reason why we must reject such off_reg is that we otherwise would not be able to derive a mask, see details in `9d7eceede7` ("bpf: restrict unknown scalars of mixed signed bounds for unprivileged"). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> [fllinden@amazon.com: backport to 5.4] Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-02 11:05:03 +02:00
Daniel Borkmann	4a163b1c70	bpf: Move off_reg into sanitize_ptr_alu commit 6f55b2f2a1178856c19bbce2f71449926e731914 upstream. Small refactor to drag off_reg into sanitize_ptr_alu(), so we later on can use off_reg for generalizing some of the checks for all pointer types. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-05-02 11:05:03 +02:00
Andrey Zhizhikin	868d3c6345	This is the 5.4.115 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmCJRSsACgkQONu9yGCS aT4oBw//cuCRj7W9NCuU/n582jrtMTJ6r/hbC3+CG8bYxWyA9e8WrqVEuasHiHzQ +TNC86rdtzch+Ws9txww2Q/Cg316M9GfXonL+PGx4xtPr2UG0fs0pdnpaTqESd5V QYz3hYtfMDvavuIGpLeYqRvUmxXQDVWeTNbpmmbSs0qmICUgQZ4sxGBVdhTnkQ3U 8C1U0K5rkHztWLA/VJR5SrcFzNEXJUECq6LUPBSNBe90HCLBiP9kE98FUG1vVYdM 2fJ68v01p9D2lYUL9DY43ncpk1+2r1fzj3NNhhlKmjgbmvpzLHbdFt/W5BpcZvYP v4Tx0x7n0f1E7KW7/42DeMYENTDMGgT+6NMLJUCR7KJ0V/sfaXSppm9uYnFGtX2v 21tu5vz2IO/+070IT5iMdp2AARPvt4KBb+Be4pm58fzgJf59QFFoYZdGiPxY2/ga AkfwGxOPXHLe9BhRDl9lZnOzjRKSy3eRK0SGj+ctn2p03O8DNn/1P4Hd4imbtRSx 4bS+oEn3NrUod2pdjK6ZPFKVguYrbqKirQHtIKKE2Fk8Gfcz5/6+31VshXqpnqek D6d7wRSGYhjcZ33/jyGLkUs0vh5zNPr/c0Dtx3yDYdGot07LxXS1iSoHcG+qYjsq iCFve2twzdvtLQb80eSb2HvdBxp8uxzyLz4IITM4KcrU39YAxbw= =/d/n -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmCJX2EACgkQ7G51OISz Hs3maw/+JaewWgsS7a9dcUei0NLsHxJh+LdgBIZnoDtLYCmAjnhzRhH63AqzENUA sGhj6DlNoaMz9IK6bMby6bWyvRSXQOjfos/Lo6xXhoIlrPiSalG/akYp9TuIeR24 +j5nDFWo+T9wZlVtoLnkE9nfE7ubnPDzC/7fSPnlIGjtJLkBhKYL7iuxSeCsJY2n 2E3jj30ReJVab7zTtTHwQZZJGOrMJ8RGOjP+r9jGT2spyEGIjBUQ1Ly7i+a7TLG2 2IQ3dp/lwEsJgpMfJBYPEAYn8gn8704P73zh5sepqgv+EdOAWgTCWaC73F7fJ61s g7/VRK86A/B6XkXvOruKf9fqKbFEmp7zlIOW+yRLnool1wMa24jrOeWzbHbuMPyW miL6xuC5E3lJjnxcUebGWJpaOewCTmo2aDAnoJ6E4wVTinjPGaZSOL95aEIzRPhM NbYHJeCzIQt4bnGFJElTbSziJToAsVCy9tbrZ6hjS9Nxg/AkdjJSk+79ehqQpJwq DP4L9tVkzlcN5u3pesQs64wcN3GH6lYD++lD+t5dzR9R2Fn7RpNw8HJnOE86QvAj t5glqPiaErQlCcDXCQNGpqISfuXkU7RgeDEvN3iMsUH7kd/i8vLpLY3KOERXu+1N ItqqBIIVhbGDLZMcvFdYIGzJQOnINawC/Om8D3ct0qtC49n1Buw= =AoGP -----END PGP SIGNATURE----- Merge tag 'v5.4.115' into 5.4-2.3.x-imx This is the 5.4.115 stable release Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>	2021-04-28 13:13:02 +00:00
Ali Saidi	82808cc026	locking/qrwlock: Fix ordering in queued_write_lock_slowpath() [ Upstream commit 84a24bf8c52e66b7ac89ada5e3cfbe72d65c1896 ] While this code is executed with the wait_lock held, a reader can acquire the lock without holding wait_lock. The writer side loops checking the value with the atomic_cond_read_acquire(), but only truly acquires the lock when the compare-and-exchange is completed successfully which isn’t ordered. This exposes the window between the acquire and the cmpxchg to an A-B-A problem which allows reads following the lock acquisition to observe values speculatively before the write lock is truly acquired. We've seen a problem in epoll where the reader does a xchg while holding the read lock, but the writer can see a value change out from under it. Writer \| Reader -------------------------------------------------------------------------------- ep_scan_ready_list() \| \|- write_lock_irq() \| \|- queued_write_lock_slowpath() \| \|- atomic_cond_read_acquire() \| \| read_lock_irqsave(&ep->lock, flags); --> (observes value before unlock) \| chain_epi_lockless() \| \| epi->next = xchg(&ep->ovflist, epi); \| \| read_unlock_irqrestore(&ep->lock, flags); \| \| \| atomic_cmpxchg_relaxed() \| \|-- READ_ONCE(ep->ovflist); \| A core can order the read of the ovflist ahead of the atomic_cmpxchg_relaxed(). Switching the cmpxchg to use acquire semantics addresses this issue at which point the atomic_cond_read can be switched to use relaxed semantics. Fixes: `b519b56e37` ("locking/qrwlock: Use atomic_cond_read_acquire() when spinning in qrwlock") Signed-off-by: Ali Saidi <alisaidi@amazon.com> [peterz: use try_cmpxchg()] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steve Capper <steve.capper@arm.com> Acked-by: Will Deacon <will@kernel.org> Acked-by: Waiman Long <longman@redhat.com> Tested-by: Steve Capper <steve.capper@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-04-28 13:19:14 +02:00
Andrey Zhizhikin	6800343925	This is the 5.4.114 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmCABPYACgkQONu9yGCS aT610xAAuFVE0FEaisv42yiS/jNtZk8NpPuBSaB1vP9TOyn1PyrO4p2klUdrFrLX 2d7ssYKZimDS4HB0lmr6tPXjQCuI3E2qB3s9mJntEZuxjweLR22uLC7DtWo4VDYt 87oM+jaWMao+3YOXpvbd2S8tA/WkvaBbYmXAGsO2XFoyUPhzxBXi+Mzoj5WeGPtc bQd+Odt1n00HJSyuSlXaBeuwzVHLq43Kxm2kKt7lBH3W1IKElRnw84XJAHAylDZ4 EwIkgncGm7EN25Nk9EESC0cvCBM6PK61S7CggOtcvyrPGRBqFlmbDKFxLT2BxIdP MuyXLvHRm6/oQb1brvWdeHw0++KwJ884HJF2/bB9ZXU8wCR317BA3dYMdSMvv7V3 3zickdfoPW8c5H/8t+BobGJoHFQ895xrwxAcQGR8oBtfjqo4JGd+9QnoEdaXb/7o 0t36qJLFYVKfkaeTxOTyQeImw79KVv4T/hXlwPBkdBu/yC8kfVL6ckJ+MfL1LH4B BkDMT5K/Fyp3HIyNakiB9c0s9ZgdvCI0hpZvEXX69VmEVnoEokxcQXc0YHvqMpz6 neH8snkuWc2fQhsSa4hkhcmx50ohTd5MQPJ7Fp8sx4NrLJ1aJNvG1YhWqJCNkRBx TRWSB7ipf+mesMHn4RIXINHmgvxpHHP/B6O3DvtSJPjZBtpImWw= =Xqnx -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmCARfIACgkQ7G51OISz Hs3JRxAApNK3PDMJvZmIrzQe6lXEInjoClMG/vVB99ueHQkEQbv5aswyr56Y+2JY /2UDQFEmCKSAkm1giztq2IO2s6ddL/f9OSzF+cbC4fHefI2MF2bvFiNwkJtY7NMG x7x6TGn+1gCBwDFIv7vK8TpK9jT4mGX4vfitVqRPOhFcsX2R7l4wMwFJtGU8oNHh lZUzF4XTPLsyxvC2BSVYkMFh5w+a52dF3zhsqhEr+sQSDa5EM6+tzusj0NHSk2Gm AziZTapRUyzo2YWcg0EcJaGjYOc7Be6EJJOdPWeV3qJfEAN2eOyV0g6yijiB1uhi XZmoVY46jY8lUsqZQRdRkalv/aY6Qrso7TsI6AaL2LaaZ+PW+PemozriPpwQS9F5 OOrEIpfVQRha3H8PmAoKvwBjgbHbr5i+kVcUoSPwdpSMuPo8W3Hn6Ny1Cmcukrhk +onVYRz96BG35JRYYs+8WlJgPJsSBY+GHHhFTefIXmNaERdZMTTr3GLcD+6i7Qai 6YV/8nRMY2A6McbOafQ+dPvSgogpfYW9AEjgEqtCRllTo+eoEHjWWpxOzOA0pCq8 L7BXwReGT5mwsRnqI70eTGNusKlLH8WjfbDOrOx52b/Exzp1m2nsZfvMpoKo49xA gAeV24mGhXM9iLIRbT434/kF5dTcw0U7rgubY34BvIT3ilWVcDM= =2Kdu -----END PGP SIGNATURE----- Merge tag 'v5.4.114' into 5.4-2.3.x-imx This is the 5.4.114 stable release Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>	2021-04-21 15:34:08 +00:00
Tetsuo Handa	68bd0d8ab1	lockdep: Add a missing initialization hint to the "INFO: Trying to register non-static key" message [ Upstream commit 3a85969e9d912d5dd85362ee37b5f81266e00e77 ] Since this message is printed when dynamically allocated spinlocks (e.g. kzalloc()) are used without initialization (e.g. spin_lock_init()), suggest to developers to check whether initialization functions for objects were called, before making developers wonder what annotation is missing. [ mingo: Minor tweaks to the message. ] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210321064913.4619-1-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-04-21 12:56:13 +02:00
Andrey Zhizhikin	8ad3d2ca53	This is the 5.4.112 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmB2je0ACgkQONu9yGCS aT5LSQ//RbX6sC5N9hmM6XdixRqDXF0YZG6ADrZ24tEIUAvjXZa9rOFGlKyS2JAV 6KkqRfkrYK2lhyP0lGSkmWPQGoyocxV/6jLcA4XyTqetzxYRkYyW1jiEz7KCTp0+ AMwqazbMAlaTOTxbNk0TqTsLDrSAE1a5mX9XjPCqjFm1yVjc7gNxxXwKhX01u4LD bTw+vMaMtf9MW8sfV1vU9HOcH0BFwp9Sr0/AFb05u8F4BH9MS0XGa6c2bG1o1qQM bF7g1aZIcVgn0Jr8WrpsF/7tTUyy3l+XXBvyFNRYvqAnrdUrTDn2ItAPq3W5hqTu Y0fdcbAtmmnrHcDeGUD+kuaCTvQGSy+qgZAFvQRkzCmweyY+rvqLEJhO7sBpjqCv MszRkYvA0Ji4JaWUWxVlHbmbdIBQ8Jvo9ZMM7shAKq66a26De1W5CIJXTnZXJSij dALJowoEKJ2i7V63AoJSzEOlBDYoBUY8xbVzDEjdfBTbj2Gb+cVWRRTsGDKZeuqs 933fPTRMBOc2q36q6PVpUcpaRLktAFvc33FYdSK8M3/aN22ISQ1QbXqm47sXyQbk pHUqRFUJdvjVtQltYIiBQ/GgKY3+TQw9FtRjoSCuZuEeYjE8p004Wq/rWWIv+5mm jwY5gfsXKjQcP/Pcxl15kcmNQ4axkC/Jzln99xFScatXV6Ksqh0= =sCGS -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmB2vwwACgkQ7G51OISz Hs3sfg/+M5ZPNZXWbe44ku13A3pf9g/UToQsKI90CbBMlVeQctti3Fnn1ueO0qwB hQ0LFDekK5YRJwfQb3+5QbaQTacLiR4RzvW0L6WHzq9cpbFRbW6gEl4nMO15Mq5c o3ctpq3tO0DC7k84tXzV/wsHwV83FxmUC0BQCIzRsoxlTjB+BDf+Sy1giCHmJ9l1 G3sMzA1LmOSRY24q/l7hwTeVMr78S1uqP/xKq+TErRhuTtKAxINv2tiWTuMBsPLL g4OFzEvlX0OLxBihgBC6rSnAP61plq5qoaBLQjj7ex7MnAyBmNCFCZcPGIJgLMhj 5QrqEbMf8YFppXT9J0yUVxWLk35GmVdr1i5VpwyIVy0U8OrO72R8Bsj8zPSdAf5k yesJ4uINYcJfIRlmvc0vRcy55yMENbgxnwJBo7/Wcq8M1b/Ws96YFMF9QUQLOPBL nJtvYpuRk2Xx/GU15C0tn7f4nhVaP9z0PPZAx9GDYbTV5taYUWBDHK7cKV0rCgVj SVrndIS93kuBxqLdo1L+oPSOl0h4C70DG7zzb6Z5wU9CyvmxVClESXwj8p5AO+CK SQIvTe1G0QJnt+miLRpTpn4r4EfqUEbHkWOjIimaZ6fLdy7LDICyM6lFLhzjRp/i kal4DzCr1F54njzMseNI1D6C55unOynsN48ZzS4LPuvASgpLFzo= =ynil -----END PGP SIGNATURE----- Merge tag 'v5.4.112' into 5.4-2.3.x-imx This is the 5.4.112 stable release Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>	2021-04-14 10:08:08 +00:00
Zqiang	c88fa8d4f9	workqueue: Move the position of debug_work_activate() in __queue_work() [ Upstream commit 0687c66b5f666b5ad433f4e94251590d9bc9d10e ] The debug_work_activate() is called on the premise that the work can be inserted, because if wq be in WQ_DRAINING status, insert work may be failed. Fixes: `e41e704bc4` ("workqueue: improve destroy_workqueue() debuggability") Signed-off-by: Zqiang <qiang.zhang@windriver.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-04-14 08:24:17 +02:00
Nick Desaulniers	7a92396bf8	gcov: re-fix clang-11+ support commit 9562fd132985ea9185388a112e50f2a51557827d upstream. LLVM changed the expected function signature for llvm_gcda_emit_function() in the clang-11 release. Users of clang-11 or newer may have noticed their kernels producing invalid coverage information: $ llvm-cov gcov -a -c -u -f -b <input>.gcda -- gcno=<input>.gcno 1 <func>: checksum mismatch, \ (<lineno chksum A>, <cfg chksum B>) != (<lineno chksum A>, <cfg chksum C>) 2 Invalid .gcda File! ... Fix up the function signatures so calling this function interprets its parameters correctly and computes the correct cfg checksum. In particular, in clang-11, the additional checksum is no longer optional. Link: https://reviews.llvm.org/rG25544ce2df0daa4304c07e64b9c8b0f7df60c11d Link: https://lkml.kernel.org/r/20210408184631.1156669-1-ndesaulniers@google.com Reported-by: Prasad Sodagudi <psodagud@quicinc.com> Tested-by: Prasad Sodagudi <psodagud@quicinc.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Cc: <stable@vger.kernel.org> [5.4+] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-04-14 08:24:10 +02:00

1 2 3 4 5 ...

32114 Commits