Commit Graph

32114 Commits

Author SHA1 Message Date
Andrey Zhizhikin 913880358f This is the 5.4.149 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmFQYpIACgkQONu9yGCS
 aT5nNBAAnao6g0C+ZSPUJrX3Aa9I+dcP8le9T4faUD1E8fa3XSIrDhUrZwhvdI06
 ljos1XzQe60/CJmY1jnfL4TOQqrfS6tLTseVGIhFAtJQorxxQwzuCEq/2sqlYz/A
 BEN1/g1XjXyFMmKw598luTClbHk91pnScxA0ZyJ28lhNeBnpuHKK5+PvqNT2bg6G
 Vc8IGPv7cd48FjfwBzDuWklsQE9FFHPtq2eyhAk6K9QbECnP9wgfdrPx87oyRGN6
 tPtSEhlwNM8EEaFZ1/1zgTgj3n35I3LXGfV19YRid20y1SbwB8yFloidx8SjaAOE
 rMpiyxcDgfYoeHw5WBt+f/QVLx3Ia8uEFgwFSHyD1btNrPGdAlatWgXSrNBLvQuy
 jIoDtqY9L5Ty3T3rBjyDlXl0oUUDD4JyVteGsrXlzVEHa7YaLIhvrcQ6Es09XDZX
 TXPinEMPTohO7/cCVHjXOuREXeYukXLrKuZBBNTutANP9Yx7Tj9yAwVtrnakkv1B
 WykWhjJSmOHcj3q8hm9i1GI8qo3sWIwvM0c8in1OLzA+vpjXPR9onA8PHYidj8LY
 f4E3I2Xp+zBj8WljLgHIJhpwdo8jq5StdYPl0y3Na/ZVU3El3VKtwiT0RieOnKfp
 aOj16+CdbCDpdDZofqu/Zio4Do0RFPsHHeEnfedH38aOw1bjctw=
 =e/m/
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmFRuk4ACgkQ7G51OISz
 Hs05Iw/9EJFb9xJJhi1GzCVg6X5rF4Nna8yNwZOtEWtnG3fE0FcJJ1AP8pN0l93v
 yI+c6KbNl30hqI25ji/VEgoLlOOe5/5QctnPXryXrkf4jvoXQC2TZSGhFfznBlYK
 XcfbaYSlCf9fgaFQBDfi9S1/k/8yVDs9QKIKf6aUlXKmNWhF9I3Fp6nQ1cnaZjMp
 tMto2WOilgCFMwb9EuSTuSPHE0bDqmwk7npVQ8/HA2QdMzLySwHVUZKD8vC9Kvsc
 1csT8lkWak+7N5p2RNyrJAORL9OBvhdPZVyh0cMet00T+zDV0aUDpn3/6R4gnFUz
 TmRdG/oRMqlkhQYvKpFiE295AtU41TMjc9Mb0dg9KYCsCh9yTGgZdR0ftYOHFb+o
 wgXBkURAbj7JETTtRBVMRDEV8Zo1iJjpvGmILT8jO7IsQppkTh2JcjI5rnoxUdGB
 eAd5W36aZfnOEPdUpoWf2mNF8bgQodDVYHc8cPlknJPKP3GM6EZDUNEkKRFTe8hD
 /IP4yLrOvqaxP9n82bPNuRTpKM365Q+rmHkPlTof8ZLYCB52gdJBg9/oTVGfuLfE
 o+DtHU1sBuzukyPEJE2okb3D9uqKM0/xnUxhXQqxDFhXMcI5npllQ7FK71FOWtcp
 4YTGwf/Y4DmK0tckzNg1URnzATRR+6b8zjMR1zc9gfUawnBhwWU=
 =zMkr
 -----END PGP SIGNATURE-----

Merge tag 'v5.4.149' into 5.4-2.3.x-imx

This is the 5.4.149 stable release

Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>
2021-09-27 12:34:19 +00:00
Pavel Skripkin b94def8a47 profiling: fix shift-out-of-bounds bugs
commit 2d186afd04d669fe9c48b994c41a7405a3c9f16d upstream.

Syzbot reported shift-out-of-bounds bug in profile_init().
The problem was in incorrect prof_shift. Since prof_shift value comes from
userspace we need to clamp this value into [0, BITS_PER_LONG -1]
boundaries.

Second possible shiht-out-of-bounds was found by Tetsuo:
sample_step local variable in read_profile() had "unsigned int" type,
but prof_shift allows to make a BITS_PER_LONG shift. So, to prevent
possible shiht-out-of-bounds sample_step type was changed to
"unsigned long".

Also, "unsigned short int" will be sufficient for storing
[0, BITS_PER_LONG] value, that's why there is no need for
"unsigned long" prof_shift.

Link: https://lkml.kernel.org/r/20210813140022.5011-1-paskripkin@gmail.com
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-and-tested-by: syzbot+e68c89a9510c159d9684@syzkaller.appspotmail.com
Suggested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-26 14:07:09 +02:00
Cyrill Gorcunov 5607b1bae1 prctl: allow to setup brk for et_dyn executables
commit e1fbbd073137a9d63279f6bf363151a938347640 upstream.

Keno Fischer reported that when a binray loaded via ld-linux-x the
prctl(PR_SET_MM_MAP) doesn't allow to setup brk value because it lays
before mm:end_data.

For example a test program shows

 | # ~/t
 |
 | start_code      401000
 | end_code        401a15
 | start_stack     7ffce4577dd0
 | start_data	   403e10
 | end_data        40408c
 | start_brk	   b5b000
 | sbrk(0)         b5b000

and when executed via ld-linux

 | # /lib64/ld-linux-x86-64.so.2 ~/t
 |
 | start_code      7fc25b0a4000
 | end_code        7fc25b0c4524
 | start_stack     7fffcc6b2400
 | start_data	   7fc25b0ce4c0
 | end_data        7fc25b0cff98
 | start_brk	   55555710c000
 | sbrk(0)         55555710c000

This of course prevent criu from restoring such programs.  Looking into
how kernel operates with brk/start_brk inside brk() syscall I don't see
any problem if we allow to setup brk/start_brk without checking for
end_data.  Even if someone pass some weird address here on a purpose then
the worst possible result will be an unexpected unmapping of existing vma
(own vma, since prctl works with the callers memory) but test for
RLIMIT_DATA is still valid and a user won't be able to gain more memory in
case of expanding VMAs via new values shipped with prctl call.

Link: https://lkml.kernel.org/r/20210121221207.GB2174@grain
Fixes: bbdc6076d2 ("binfmt_elf: move brk out of mmap when doing direct loader exec")
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reported-by: Keno Fischer <keno@juliacomputing.com>
Acked-by: Andrey Vagin <avagin@gmail.com>
Tested-by: Andrey Vagin <avagin@gmail.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-26 14:07:08 +02:00
Andrey Zhizhikin 8b231e0e8e This is the 5.4.148 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmFLBPMACgkQONu9yGCS
 aT6BIQ//Wb4ZQJtEVvaKnda7vFwe8BoZzPGYZA4Imn9KERDRgHuavEuRfMQtKc2y
 YHwe/PD2JreuDHcd+Wz32xsdMe045xNvgiE1oGcxq0jNBvhJqANSmVTWpdqAquON
 cTmwsK3roa7ELC2g1WjrYZDv6CrCggqvbuM9AJ/cLITtd8zerhLdZo+CCDG/28cH
 EosrWvkBcaGmX+r/IBC86Rt6K2OFQ/3LLbb79L4vjKi5lopsm5CTAmfOfIk8p1gB
 mGB3PkQZnIqphBfqGXLGuljl4e+zb1SONrugUh78Egom393Ex34oo+RjWEGe9dV2
 Stkuqo0GTi85X7JA7SGCA/xgF8A8yvaaLjQBsJsL9+2ji+GW+J7hfn4mE5h8H3Di
 UBjeLMFJA8Mge8Ng9xUSttvjRdwSTm0jWTS9SOl07w24b0pKYbMrQdWt2eI6CT+/
 ytq3nCxNJZKeVcAVH+OJNrbSLYvMy/PgYvGTbzASkNmpAeyNiHOyBz1sRcoiAM9U
 QCWDdZyaqDKktqEyKHxK3opqPzbnHfZFFlCxR7Gw7vvR+itIGJEh/50RNv2F6vnu
 wzowrVxe+Bf1h7JiNEqLLVHdiuygRqjH1ygepGM4+3TVF4jYHzDISyrqlA/Se3Pg
 Hhvlzsbv7PH+KiApwBFjSeHTs5WOrokGMFQ7ZYFDpPkleWiywS0=
 =50Hk
 -----END PGP SIGNATURE-----

Merge tag 'v5.4.148' into 5.4-2.3.x-imx

This is the 5.4.148 stable release

Conflicts:
- drivers/dma/imx-sdma.c:
Following upstream patches are already applied to NXP tree:
7cfbf391e8 ("dmaengine: imx-sdma: remove duplicated sdma_load_context")
788122c99d ("Revert "dmaengine: imx-sdma: refine to load context only
once"")

- drivers/usb/chipidea/host.c:
Merge upstream commit a18cfd715e ("usb: chipidea: host: fix port index
underflow and UBSAN complains") to NXP version.

Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>
2021-09-22 13:19:32 +00:00
Masami Hiramatsu b3435cd968 tracing/probes: Reject events which have the same name of existing one
[ Upstream commit 8e242060c6a4947e8ae7d29794af6a581db08841 ]

Since kprobe_events and uprobe_events only check whether the
other same-type probe event has the same name or not, if the
user gives the same name of the existing tracepoint event (or
the other type of probe events), it silently fails to create
the tracefs entry (but registered.) as below.

/sys/kernel/tracing # ls events/task/task_rename
enable   filter   format   hist     id       trigger
/sys/kernel/tracing # echo p:task/task_rename vfs_read >> kprobe_events
[  113.048508] Could not create tracefs 'task_rename' directory
/sys/kernel/tracing # cat kprobe_events
p:task/task_rename vfs_read

To fix this issue, check whether the existing events have the
same name or not in trace_probe_register_event_call(). If exists,
it rejects to register the new event.

Link: https://lkml.kernel.org/r/162936876189.187130.17558311387542061930.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-22 12:26:43 +02:00
Baptiste Lepers 172749c879 events: Reuse value read using READ_ONCE instead of re-reading it
commit b89a05b21f46150ac10a962aa50109250b56b03b upstream.

In perf_event_addr_filters_apply, the task associated with
the event (event->ctx->task) is read using READ_ONCE at the beginning
of the function, checked, and then re-read from event->ctx->task,
voiding all guarantees of the checks. Reuse the value that was read by
READ_ONCE to ensure the consistency of the task struct throughout the
function.

Fixes: 375637bc52 ("perf/core: Introduce address range filtering")
Signed-off-by: Baptiste Lepers <baptiste.lepers@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210906015310.12802-1-baptiste.lepers@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-22 12:26:41 +02:00
Vasily Averin 0569920e43 memcg: enable accounting for pids in nested pid namespaces
commit fab827dbee8c2e06ca4ba000fa6c48bcf9054aba upstream.

Commit 5d097056c9 ("kmemcg: account certain kmem allocations to memcg")
enabled memcg accounting for pids allocated from init_pid_ns.pid_cachep,
but forgot to adjust the setting for nested pid namespaces.  As a result,
pid memory is not accounted exactly where it is really needed, inside
memcg-limited containers with their own pid namespaces.

Pid was one the first kernel objects enabled for memcg accounting.
init_pid_ns.pid_cachep marked by SLAB_ACCOUNT and we can expect that any
new pids in the system are memcg-accounted.

Though recently I've noticed that it is wrong.  nested pid namespaces
creates own slab caches for pid objects, nested pids have increased size
because contain id both for all parent and for own pid namespaces.  The
problem is that these slab caches are _NOT_ marked by SLAB_ACCOUNT, as a
result any pids allocated in nested pid namespaces are not
memcg-accounted.

Pid struct in nested pid namespace consumes up to 500 bytes memory, 100000
such objects gives us up to ~50Mb unaccounted memory, this allow container
to exceed assigned memcg limits.

Link: https://lkml.kernel.org/r/8b6de616-fd1a-02c6-cbdb-976ecdcfa604@virtuozzo.com
Fixes: 5d097056c9 ("kmemcg: account certain kmem allocations to memcg")
Cc: stable@vger.kernel.org
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-22 12:26:37 +02:00
Liu Zixian 22b11dbbf9 mm/hugetlb: initialize hugetlb_usage in mm_init
commit 13db8c50477d83ad3e3b9b0ae247e5cd833a7ae4 upstream.

After fork, the child process will get incorrect (2x) hugetlb_usage.  If
a process uses 5 2MB hugetlb pages in an anonymous mapping,

	HugetlbPages:	   10240 kB

and then forks, the child will show,

	HugetlbPages:	   20480 kB

The reason for double the amount is because hugetlb_usage will be copied
from the parent and then increased when we copy page tables from parent
to child.  Child will have 2x actual usage.

Fix this by adding hugetlb_count_init in mm_init.

Link: https://lkml.kernel.org/r/20210826071742.877-1-liuzixian4@huawei.com
Fixes: 5d317b2b65 ("mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status")
Signed-off-by: Liu Zixian <liuzixian4@huawei.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-22 12:26:37 +02:00
Zhen Lei b6cee35839 workqueue: Fix possible memory leaks in wq_numa_init()
[ Upstream commit f728c4a9e8405caae69d4bc1232c54ff57b5d20f ]

In error handling branch "if (WARN_ON(node == NUMA_NO_NODE))", the
previously allocated memories are not released. Doing this before
allocating memory eliminates memory leaks.

tj: Note that the condition only occurs when the arch code is pretty broken
and the WARN_ON might as well be BUG_ON().

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-22 12:26:30 +02:00
Anthony Iliopoulos be09cbd6a3 dma-debug: fix debugfs initialization order
[ Upstream commit 173735c346c412d9f084825ecb04f24ada0e2986 ]

Due to link order, dma_debug_init is called before debugfs has a chance
to initialize (via debugfs_init which also happens in the core initcall
stage), so the directories for dma-debug are never created.

Decouple dma_debug_fs_init from dma_debug_init and defer its init until
core_initcall_sync (after debugfs has been initialized) while letting
dma-debug initialization occur as soon as possible to catch any early
mappings, as suggested in [1].

[1] https://lore.kernel.org/linux-iommu/YIgGa6yF%2Fadg8OSN@kroah.com/

Fixes: 15b28bbcd5 ("dma-debug: move initialization to common code")
Signed-off-by: Anthony Iliopoulos <ailiop@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-22 12:26:24 +02:00
Andrey Zhizhikin 33478cc104 This is the 5.4.147 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmFDItIACgkQONu9yGCS
 aT7qHw/+MC8nqbYT/fMu3ZsTsNB8JZelGhrpXVAFc4E80oZDVEF3aNh8koTk13o/
 sI+9LZD6vPJ7kxPBT45PPxDfwM7VlVlQTVWQ8lUjY+8a2Ml7xZSiz9I3bMX0MsvU
 ylXeAJ4BntXI4679ccSjWGwwcQa+PrHsPrpZqKLBI4+pjxps3eMGwK0yE5jcRd4Z
 WlhgYbE8QhmB4iWSA5CBr7IY5pVIlKpvovlvIi1TlU1C9LXU6O8OBPkhmbuFF4fG
 HY2ge6d+xZItcn9RFi+uH51PwPBpN9U2I+QakQ4iyMNgF1uqHzfWh30ZeINxHzw2
 2Nn/nCmNeUwoOt6YSQGxlZUieqqvq5y4VeXo2nThqBdds8GPJ4AhvvxmM/NnP0EV
 lfI6RSxJcYULfl0XQB5sj3fJ2Fo7G7Mx+kg0q0018iTuOx9kNdQF/WXdLqHvBUIi
 VIaRFR2wKeWXCV7tmb5o8928clv0FWJRi19Gcq8597zWxGC8mlSOg48jVVMkE9rU
 IaYhIiIRL/Dkf34Hal2+mZbvbD5IjtViw9uYOzME7NuF4VVHD6RavsMOM8AuUc+t
 OtSRo6mID1rH+RrvFZhZRzcT3Fg5NVbmEhiVG7tH26ZEJE2Gs/CjexQgTHf25+VL
 /H8Mnjjw7gAmvqAJyG9s3eocnhuuWeYj5YWg3vAxVNoUqUT2uO4=
 =RNKE
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmFDnvkACgkQ7G51OISz
 Hs17yQ/8CoCEE3h3RZQ07Y4RSyePZU96IRGqd5Dhyli0BUgWahmaaoBrXYQPL2c1
 IzBW2/e0vxCwMflVHxRMS6tXoTyj3dXuNG30SU+pkCDnsWyrWxtNALFgDt0a/wHa
 pszFKfnXKxrYOKKU2gSPxFu1+0Tm0zV6puwgvwcuKzUFx/wjIZxH+8+fLhTWYu2X
 7OBuwCvDjbsQgnruKbyysI14DBu4rfYm/7hXLgRLCtYRdUYeuVrR1W5N8QEqLYr4
 daZHCl2T4VF6qiiE5FdZc6yfW0H6KSK8A1B9PYZnSJyeYwLS1edXQFbZq4VBSjPh
 wIefx3xB0ZKuyPUHc1EYjxvxfYJCbWjFaOT8aT+qDxkpYgMRltur1fmE1fS0P9hp
 3g2JrSl5EzzNqklZUuC51X4Gq2rGrO31FCXcOh/rAfbJ4sHeUlWU2pnTN7ilJt/W
 yW51Rl+mYPNTY964iCkMzVmNyAH9oIgJyR7a9P0gkIUwCqq780POtN2bMzkS/Om0
 lauKR5NmuNVw//jfeQddrrdVILfErZ5tHplGNsk3VFMi2/ITNEbmm1CJGb2090F/
 EK1lcWOj6vuETN9Lqc/dv2hX0UgDG8I1SVb1SXwRLCCZREiuoS+Z3EmMfLAkmV46
 tXz+x34BTvMl3CEZMvy69ovTrJlh6537mFx2T7+FNBqeDLHT2Z4=
 =yJQM
 -----END PGP SIGNATURE-----

Merge tag 'v5.4.147' into 5.4-2.3.x-imx

This is the 5.4.147 stable release

Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>
2021-09-16 19:45:58 +00:00
Greg Kroah-Hartman dc15f641c6 Revert "posix-cpu-timers: Force next expiration recalc after itimer reset"
This reverts commit c322a963d5 which is
commit 406dd42bd1ba0c01babf9cde169bb319e52f6147 upstream.

It is reported to cause regressions.  A proposed fix has been posted,
but it is not in a released kernel yet.  So just revert this from the
stable release so that the bug is fixed.  If it's really needed we can
add it back in in a future release.

Link: https://lore.kernel.org/r/87ilz1pwaq.fsf@wylie.me.uk
Reported-by: "Alan J. Wylie" <alan@wylie.me.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-16 12:56:13 +02:00
Daniel Borkmann ae968e270f bpf: Fix pointer arithmetic mask tightening under state pruning
commit e042aa532c84d18ff13291d00620502ce7a38dda upstream.

In 7fedb63a8307 ("bpf: Tighten speculative pointer arithmetic mask") we
narrowed the offset mask for unprivileged pointer arithmetic in order to
mitigate a corner case where in the speculative domain it is possible to
advance, for example, the map value pointer by up to value_size-1 out-of-
bounds in order to leak kernel memory via side-channel to user space.

The verifier's state pruning for scalars leaves one corner case open
where in the first verification path R_x holds an unknown scalar with an
aux->alu_limit of e.g. 7, and in a second verification path that same
register R_x, here denoted as R_x', holds an unknown scalar which has
tighter bounds and would thus satisfy range_within(R_x, R_x') as well as
tnum_in(R_x, R_x') for state pruning, yielding an aux->alu_limit of 3:
Given the second path fits the register constraints for pruning, the final
generated mask from aux->alu_limit will remain at 7. While technically
not wrong for the non-speculative domain, it would however be possible
to craft similar cases where the mask would be too wide as in 7fedb63a8307.

One way to fix it is to detect the presence of unknown scalar map pointer
arithmetic and force a deeper search on unknown scalars to ensure that
we do not run into a masking mismatch.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[OP: adjusted context in include/linux/bpf_verifier.h for 5.4]
Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-15 09:47:39 +02:00
Lorenz Bauer a0a4778fea bpf: verifier: Allocate idmap scratch in verifier env
commit c9e73e3d2b1eb1ea7ff068e05007eec3bd8ef1c9 upstream.

func_states_equal makes a very short lived allocation for idmap,
probably because it's too large to fit on the stack. However the
function is called quite often, leading to a lot of alloc / free
churn. Replace the temporary allocation with dedicated scratch
space in struct bpf_verifier_env.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://lore.kernel.org/bpf/20210429134656.122225-4-lmb@cloudflare.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[OP: adjusted context for 5.4]
Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-15 09:47:38 +02:00
Daniel Borkmann f5893af270 bpf: Fix leakage due to insufficient speculative store bypass mitigation
commit 2039f26f3aca5b0e419b98f65dd36481337b86ee upstream.

Spectre v4 gadgets make use of memory disambiguation, which is a set of
techniques that execute memory access instructions, that is, loads and
stores, out of program order; Intel's optimization manual, section 2.4.4.5:

  A load instruction micro-op may depend on a preceding store. Many
  microarchitectures block loads until all preceding store addresses are
  known. The memory disambiguator predicts which loads will not depend on
  any previous stores. When the disambiguator predicts that a load does
  not have such a dependency, the load takes its data from the L1 data
  cache. Eventually, the prediction is verified. If an actual conflict is
  detected, the load and all succeeding instructions are re-executed.

af86ca4e30 ("bpf: Prevent memory disambiguation attack") tried to mitigate
this attack by sanitizing the memory locations through preemptive "fast"
(low latency) stores of zero prior to the actual "slow" (high latency) store
of a pointer value such that upon dependency misprediction the CPU then
speculatively executes the load of the pointer value and retrieves the zero
value instead of the attacker controlled scalar value previously stored at
that location, meaning, subsequent access in the speculative domain is then
redirected to the "zero page".

The sanitized preemptive store of zero prior to the actual "slow" store is
done through a simple ST instruction based on r10 (frame pointer) with
relative offset to the stack location that the verifier has been tracking
on the original used register for STX, which does not have to be r10. Thus,
there are no memory dependencies for this store, since it's only using r10
and immediate constant of zero; hence af86ca4e30 /assumed/ a low latency
operation.

However, a recent attack demonstrated that this mitigation is not sufficient
since the preemptive store of zero could also be turned into a "slow" store
and is thus bypassed as well:

  [...]
  // r2 = oob address (e.g. scalar)
  // r7 = pointer to map value
  31: (7b) *(u64 *)(r10 -16) = r2
  // r9 will remain "fast" register, r10 will become "slow" register below
  32: (bf) r9 = r10
  // JIT maps BPF reg to x86 reg:
  //  r9  -> r15 (callee saved)
  //  r10 -> rbp
  // train store forward prediction to break dependency link between both r9
  // and r10 by evicting them from the predictor's LRU table.
  33: (61) r0 = *(u32 *)(r7 +24576)
  34: (63) *(u32 *)(r7 +29696) = r0
  35: (61) r0 = *(u32 *)(r7 +24580)
  36: (63) *(u32 *)(r7 +29700) = r0
  37: (61) r0 = *(u32 *)(r7 +24584)
  38: (63) *(u32 *)(r7 +29704) = r0
  39: (61) r0 = *(u32 *)(r7 +24588)
  40: (63) *(u32 *)(r7 +29708) = r0
  [...]
  543: (61) r0 = *(u32 *)(r7 +25596)
  544: (63) *(u32 *)(r7 +30716) = r0
  // prepare call to bpf_ringbuf_output() helper. the latter will cause rbp
  // to spill to stack memory while r13/r14/r15 (all callee saved regs) remain
  // in hardware registers. rbp becomes slow due to push/pop latency. below is
  // disasm of bpf_ringbuf_output() helper for better visual context:
  //
  // ffffffff8117ee20: 41 54                 push   r12
  // ffffffff8117ee22: 55                    push   rbp
  // ffffffff8117ee23: 53                    push   rbx
  // ffffffff8117ee24: 48 f7 c1 fc ff ff ff  test   rcx,0xfffffffffffffffc
  // ffffffff8117ee2b: 0f 85 af 00 00 00     jne    ffffffff8117eee0 <-- jump taken
  // [...]
  // ffffffff8117eee0: 49 c7 c4 ea ff ff ff  mov    r12,0xffffffffffffffea
  // ffffffff8117eee7: 5b                    pop    rbx
  // ffffffff8117eee8: 5d                    pop    rbp
  // ffffffff8117eee9: 4c 89 e0              mov    rax,r12
  // ffffffff8117eeec: 41 5c                 pop    r12
  // ffffffff8117eeee: c3                    ret
  545: (18) r1 = map[id:4]
  547: (bf) r2 = r7
  548: (b7) r3 = 0
  549: (b7) r4 = 4
  550: (85) call bpf_ringbuf_output#194288
  // instruction 551 inserted by verifier    \
  551: (7a) *(u64 *)(r10 -16) = 0            | /both/ are now slow stores here
  // storing map value pointer r7 at fp-16   | since value of r10 is "slow".
  552: (7b) *(u64 *)(r10 -16) = r7           /
  // following "fast" read to the same memory location, but due to dependency
  // misprediction it will speculatively execute before insn 551/552 completes.
  553: (79) r2 = *(u64 *)(r9 -16)
  // in speculative domain contains attacker controlled r2. in non-speculative
  // domain this contains r7, and thus accesses r7 +0 below.
  554: (71) r3 = *(u8 *)(r2 +0)
  // leak r3

As can be seen, the current speculative store bypass mitigation which the
verifier inserts at line 551 is insufficient since /both/, the write of
the zero sanitation as well as the map value pointer are a high latency
instruction due to prior memory access via push/pop of r10 (rbp) in contrast
to the low latency read in line 553 as r9 (r15) which stays in hardware
registers. Thus, architecturally, fp-16 is r7, however, microarchitecturally,
fp-16 can still be r2.

Initial thoughts to address this issue was to track spilled pointer loads
from stack and enforce their load via LDX through r10 as well so that /both/
the preemptive store of zero /as well as/ the load use the /same/ register
such that a dependency is created between the store and load. However, this
option is not sufficient either since it can be bypassed as well under
speculation. An updated attack with pointer spill/fills now _all_ based on
r10 would look as follows:

  [...]
  // r2 = oob address (e.g. scalar)
  // r7 = pointer to map value
  [...]
  // longer store forward prediction training sequence than before.
  2062: (61) r0 = *(u32 *)(r7 +25588)
  2063: (63) *(u32 *)(r7 +30708) = r0
  2064: (61) r0 = *(u32 *)(r7 +25592)
  2065: (63) *(u32 *)(r7 +30712) = r0
  2066: (61) r0 = *(u32 *)(r7 +25596)
  2067: (63) *(u32 *)(r7 +30716) = r0
  // store the speculative load address (scalar) this time after the store
  // forward prediction training.
  2068: (7b) *(u64 *)(r10 -16) = r2
  // preoccupy the CPU store port by running sequence of dummy stores.
  2069: (63) *(u32 *)(r7 +29696) = r0
  2070: (63) *(u32 *)(r7 +29700) = r0
  2071: (63) *(u32 *)(r7 +29704) = r0
  2072: (63) *(u32 *)(r7 +29708) = r0
  2073: (63) *(u32 *)(r7 +29712) = r0
  2074: (63) *(u32 *)(r7 +29716) = r0
  2075: (63) *(u32 *)(r7 +29720) = r0
  2076: (63) *(u32 *)(r7 +29724) = r0
  2077: (63) *(u32 *)(r7 +29728) = r0
  2078: (63) *(u32 *)(r7 +29732) = r0
  2079: (63) *(u32 *)(r7 +29736) = r0
  2080: (63) *(u32 *)(r7 +29740) = r0
  2081: (63) *(u32 *)(r7 +29744) = r0
  2082: (63) *(u32 *)(r7 +29748) = r0
  2083: (63) *(u32 *)(r7 +29752) = r0
  2084: (63) *(u32 *)(r7 +29756) = r0
  2085: (63) *(u32 *)(r7 +29760) = r0
  2086: (63) *(u32 *)(r7 +29764) = r0
  2087: (63) *(u32 *)(r7 +29768) = r0
  2088: (63) *(u32 *)(r7 +29772) = r0
  2089: (63) *(u32 *)(r7 +29776) = r0
  2090: (63) *(u32 *)(r7 +29780) = r0
  2091: (63) *(u32 *)(r7 +29784) = r0
  2092: (63) *(u32 *)(r7 +29788) = r0
  2093: (63) *(u32 *)(r7 +29792) = r0
  2094: (63) *(u32 *)(r7 +29796) = r0
  2095: (63) *(u32 *)(r7 +29800) = r0
  2096: (63) *(u32 *)(r7 +29804) = r0
  2097: (63) *(u32 *)(r7 +29808) = r0
  2098: (63) *(u32 *)(r7 +29812) = r0
  // overwrite scalar with dummy pointer; same as before, also including the
  // sanitation store with 0 from the current mitigation by the verifier.
  2099: (7a) *(u64 *)(r10 -16) = 0         | /both/ are now slow stores here
  2100: (7b) *(u64 *)(r10 -16) = r7        | since store unit is still busy.
  // load from stack intended to bypass stores.
  2101: (79) r2 = *(u64 *)(r10 -16)
  2102: (71) r3 = *(u8 *)(r2 +0)
  // leak r3
  [...]

Looking at the CPU microarchitecture, the scheduler might issue loads (such
as seen in line 2101) before stores (line 2099,2100) because the load execution
units become available while the store execution unit is still busy with the
sequence of dummy stores (line 2069-2098). And so the load may use the prior
stored scalar from r2 at address r10 -16 for speculation. The updated attack
may work less reliable on CPU microarchitectures where loads and stores share
execution resources.

This concludes that the sanitizing with zero stores from af86ca4e30 ("bpf:
Prevent memory disambiguation attack") is insufficient. Moreover, the detection
of stack reuse from af86ca4e30 where previously data (STACK_MISC) has been
written to a given stack slot where a pointer value is now to be stored does
not have sufficient coverage as precondition for the mitigation either; for
several reasons outlined as follows:

 1) Stack content from prior program runs could still be preserved and is
    therefore not "random", best example is to split a speculative store
    bypass attack between tail calls, program A would prepare and store the
    oob address at a given stack slot and then tail call into program B which
    does the "slow" store of a pointer to the stack with subsequent "fast"
    read. From program B PoV such stack slot type is STACK_INVALID, and
    therefore also must be subject to mitigation.

 2) The STACK_SPILL must not be coupled to register_is_const(&stack->spilled_ptr)
    condition, for example, the previous content of that memory location could
    also be a pointer to map or map value. Without the fix, a speculative
    store bypass is not mitigated in such precondition and can then lead to
    a type confusion in the speculative domain leaking kernel memory near
    these pointer types.

While brainstorming on various alternative mitigation possibilities, we also
stumbled upon a retrospective from Chrome developers [0]:

  [...] For variant 4, we implemented a mitigation to zero the unused memory
  of the heap prior to allocation, which cost about 1% when done concurrently
  and 4% for scavenging. Variant 4 defeats everything we could think of. We
  explored more mitigations for variant 4 but the threat proved to be more
  pervasive and dangerous than we anticipated. For example, stack slots used
  by the register allocator in the optimizing compiler could be subject to
  type confusion, leading to pointer crafting. Mitigating type confusion for
  stack slots alone would have required a complete redesign of the backend of
  the optimizing compiler, perhaps man years of work, without a guarantee of
  completeness. [...]

>From BPF side, the problem space is reduced, however, options are rather
limited. One idea that has been explored was to xor-obfuscate pointer spills
to the BPF stack:

  [...]
  // preoccupy the CPU store port by running sequence of dummy stores.
  [...]
  2106: (63) *(u32 *)(r7 +29796) = r0
  2107: (63) *(u32 *)(r7 +29800) = r0
  2108: (63) *(u32 *)(r7 +29804) = r0
  2109: (63) *(u32 *)(r7 +29808) = r0
  2110: (63) *(u32 *)(r7 +29812) = r0
  // overwrite scalar with dummy pointer; xored with random 'secret' value
  // of 943576462 before store ...
  2111: (b4) w11 = 943576462
  2112: (af) r11 ^= r7
  2113: (7b) *(u64 *)(r10 -16) = r11
  2114: (79) r11 = *(u64 *)(r10 -16)
  2115: (b4) w2 = 943576462
  2116: (af) r2 ^= r11
  // ... and restored with the same 'secret' value with the help of AX reg.
  2117: (71) r3 = *(u8 *)(r2 +0)
  [...]

While the above would not prevent speculation, it would make data leakage
infeasible by directing it to random locations. In order to be effective
and prevent type confusion under speculation, such random secret would have
to be regenerated for each store. The additional complexity involved for a
tracking mechanism that prevents jumps such that restoring spilled pointers
would not get corrupted is not worth the gain for unprivileged. Hence, the
fix in here eventually opted for emitting a non-public BPF_ST | BPF_NOSPEC
instruction which the x86 JIT translates into a lfence opcode. Inserting the
latter in between the store and load instruction is one of the mitigations
options [1]. The x86 instruction manual notes:

  [...] An LFENCE that follows an instruction that stores to memory might
  complete before the data being stored have become globally visible. [...]

The latter meaning that the preceding store instruction finished execution
and the store is at minimum guaranteed to be in the CPU's store queue, but
it's not guaranteed to be in that CPU's L1 cache at that point (globally
visible). The latter would only be guaranteed via sfence. So the load which
is guaranteed to execute after the lfence for that local CPU would have to
rely on store-to-load forwarding. [2], in section 2.3 on store buffers says:

  [...] For every store operation that is added to the ROB, an entry is
  allocated in the store buffer. This entry requires both the virtual and
  physical address of the target. Only if there is no free entry in the store
  buffer, the frontend stalls until there is an empty slot available in the
  store buffer again. Otherwise, the CPU can immediately continue adding
  subsequent instructions to the ROB and execute them out of order. On Intel
  CPUs, the store buffer has up to 56 entries. [...]

One small upside on the fix is that it lifts constraints from af86ca4e30
where the sanitize_stack_off relative to r10 must be the same when coming
from different paths. The BPF_ST | BPF_NOSPEC gets emitted after a BPF_STX
or BPF_ST instruction. This happens either when we store a pointer or data
value to the BPF stack for the first time, or upon later pointer spills.
The former needs to be enforced since otherwise stale stack data could be
leaked under speculation as outlined earlier. For non-x86 JITs the BPF_ST |
BPF_NOSPEC mapping is currently optimized away, but others could emit a
speculation barrier as well if necessary. For real-world unprivileged
programs e.g. generated by LLVM, pointer spill/fill is only generated upon
register pressure and LLVM only tries to do that for pointers which are not
used often. The program main impact will be the initial BPF_ST | BPF_NOSPEC
sanitation for the STACK_INVALID case when the first write to a stack slot
occurs e.g. upon map lookup. In future we might refine ways to mitigate
the latter cost.

  [0] https://arxiv.org/pdf/1902.05178.pdf
  [1] https://msrc-blog.microsoft.com/2018/05/21/analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639/
  [2] https://arxiv.org/pdf/1905.05725.pdf

Fixes: af86ca4e30 ("bpf: Prevent memory disambiguation attack")
Fixes: f7cf25b202 ("bpf: track spill/fill of constants")
Co-developed-by: Piotr Krysiuk <piotras@gmail.com>
Co-developed-by: Benedict Schlueter <benedict.schlueter@rub.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Piotr Krysiuk <piotras@gmail.com>
Signed-off-by: Benedict Schlueter <benedict.schlueter@rub.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[OP: - apply check_stack_write_fixed_off() changes in check_stack_write()
     - replace env->bypass_spec_v4 -> env->allow_ptr_leaks]
Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-15 09:47:38 +02:00
Daniel Borkmann e80c3533c3 bpf: Introduce BPF nospec instruction for mitigating Spectre v4
commit f5e81d1117501546b7be050c5fbafa6efd2c722c upstream.

In case of JITs, each of the JIT backends compiles the BPF nospec instruction
/either/ to a machine instruction which emits a speculation barrier /or/ to
/no/ machine instruction in case the underlying architecture is not affected
by Speculative Store Bypass or has different mitigations in place already.

This covers both x86 and (implicitly) arm64: In case of x86, we use 'lfence'
instruction for mitigation. In case of arm64, we rely on the firmware mitigation
as controlled via the ssbd kernel parameter. Whenever the mitigation is enabled,
it works for all of the kernel code with no need to provide any additional
instructions here (hence only comment in arm64 JIT). Other archs can follow
as needed. The BPF nospec instruction is specifically targeting Spectre v4
since i) we don't use a serialization barrier for the Spectre v1 case, and
ii) mitigation instructions for v1 and v4 might be different on some archs.

The BPF nospec is required for a future commit, where the BPF verifier does
annotate intermediate BPF programs with speculation barriers.

Co-developed-by: Piotr Krysiuk <piotras@gmail.com>
Co-developed-by: Benedict Schlueter <benedict.schlueter@rub.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Piotr Krysiuk <piotras@gmail.com>
Signed-off-by: Benedict Schlueter <benedict.schlueter@rub.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[OP: - adjusted context for 5.4
     - apply riscv changes to /arch/riscv/net/bpf_jit_comp.c]
Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-15 09:47:38 +02:00
Andrey Ignatov e37eeaf950 bpf: Fix possible out of bound write in narrow load handling
[ Upstream commit d7af7e497f0308bc97809cc48b58e8e0f13887e1 ]

Fix a verifier bug found by smatch static checker in [0].

This problem has never been seen in prod to my best knowledge. Fixing it
still seems to be a good idea since it's hard to say for sure whether
it's possible or not to have a scenario where a combination of
convert_ctx_access() and a narrow load would lead to an out of bound
write.

When narrow load is handled, one or two new instructions are added to
insn_buf array, but before it was only checked that

	cnt >= ARRAY_SIZE(insn_buf)

And it's safe to add a new instruction to insn_buf[cnt++] only once. The
second try will lead to out of bound write. And this is what can happen
if `shift` is set.

Fix it by making sure that if the BPF_RSH instruction has to be added in
addition to BPF_AND then there is enough space for two more instructions
in insn_buf.

The full report [0] is below:

kernel/bpf/verifier.c:12304 convert_ctx_accesses() warn: offset 'cnt' incremented past end of array
kernel/bpf/verifier.c:12311 convert_ctx_accesses() warn: offset 'cnt' incremented past end of array

kernel/bpf/verifier.c
    12282
    12283 			insn->off = off & ~(size_default - 1);
    12284 			insn->code = BPF_LDX | BPF_MEM | size_code;
    12285 		}
    12286
    12287 		target_size = 0;
    12288 		cnt = convert_ctx_access(type, insn, insn_buf, env->prog,
    12289 					 &target_size);
    12290 		if (cnt == 0 || cnt >= ARRAY_SIZE(insn_buf) ||
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
Bounds check.

    12291 		    (ctx_field_size && !target_size)) {
    12292 			verbose(env, "bpf verifier is misconfigured\n");
    12293 			return -EINVAL;
    12294 		}
    12295
    12296 		if (is_narrower_load && size < target_size) {
    12297 			u8 shift = bpf_ctx_narrow_access_offset(
    12298 				off, size, size_default) * 8;
    12299 			if (ctx_field_size <= 4) {
    12300 				if (shift)
    12301 					insn_buf[cnt++] = BPF_ALU32_IMM(BPF_RSH,
                                                         ^^^^^
increment beyond end of array

    12302 									insn->dst_reg,
    12303 									shift);
--> 12304 				insn_buf[cnt++] = BPF_ALU32_IMM(BPF_AND, insn->dst_reg,
                                                 ^^^^^
out of bounds write

    12305 								(1 << size * 8) - 1);
    12306 			} else {
    12307 				if (shift)
    12308 					insn_buf[cnt++] = BPF_ALU64_IMM(BPF_RSH,
    12309 									insn->dst_reg,
    12310 									shift);
    12311 				insn_buf[cnt++] = BPF_ALU64_IMM(BPF_AND, insn->dst_reg,
                                        ^^^^^^^^^^^^^^^
Same.

    12312 								(1ULL << size * 8) - 1);
    12313 			}
    12314 		}
    12315
    12316 		new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
    12317 		if (!new_prog)
    12318 			return -ENOMEM;
    12319
    12320 		delta += cnt - 1;
    12321
    12322 		/* keep walking new program and skip insns we just inserted */
    12323 		env->prog = new_prog;
    12324 		insn      = new_prog->insnsi + i + delta;
    12325 	}
    12326
    12327 	return 0;
    12328 }

[0] https://lore.kernel.org/bpf/20210817050843.GA21456@kili/

v1->v2:
- clarify that problem was only seen by static checker but not in prod;

Fixes: 46f53a65d2 ("bpf: Allow narrow loads with offset > 0")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210820163935.1902398-1-rdna@fb.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15 09:47:36 +02:00
Lukasz Luba 156eaacba3 PM: EM: Increase energy calculation precision
[ Upstream commit 7fcc17d0cb12938d2b3507973a6f93fc9ed2c7a1 ]

The Energy Model (EM) provides useful information about device power in
each performance state to other subsystems like: Energy Aware Scheduler
(EAS). The energy calculation in EAS does arithmetic operation based on
the EM em_cpu_energy(). Current implementation of that function uses
em_perf_state::cost as a pre-computed cost coefficient equal to:
cost = power * max_frequency / frequency.
The 'power' is expressed in milli-Watts (or in abstract scale).

There are corner cases when the EAS energy calculation for two Performance
Domains (PDs) return the same value. The EAS compares these values to
choose smaller one. It might happen that this values are equal due to
rounding error. In such scenario, we need better resolution, e.g. 1000
times better. To provide this possibility increase the resolution in the
em_perf_state::cost for 64-bit architectures. The cost of increasing
resolution on 32-bit is pretty high (64-bit division) and is not justified
since there are no new 32bit big.LITTLE EAS systems expected which would
benefit from this higher resolution.

This patch allows to avoid the rounding to milli-Watt errors, which might
occur in EAS energy estimation for each PD. The rounding error is common
for small tasks which have small utilization value.

There are two places in the code where it makes a difference:
1. In the find_energy_efficient_cpu() where we are searching for
best_delta. We might suffer there when two PDs return the same result,
like in the example below.

Scenario:
Low utilized system e.g. ~200 sum_util for PD0 and ~220 for PD1. There
are quite a few small tasks ~10-15 util. These tasks would suffer for
the rounding error. These utilization values are typical when running games
on Android. One of our partners has reported 5..10mA less battery drain
when running with increased resolution.

Some details:
We have two PDs: PD0 (big) and PD1 (little)
Let's compare w/o patch set ('old') and w/ patch set ('new')
We are comparing energy w/ task and w/o task placed in the PDs

a) 'old' w/o patch set, PD0
task_util = 13
cost = 480
sum_util_w/o_task = 215
sum_util_w_task = 228
scale_cpu = 1024
energy_w/o_task = 480 * 215 / 1024 = 100.78 => 100
energy_w_task = 480 * 228 / 1024 = 106.87 => 106
energy_diff = 106 - 100 = 6
(this is equal to 'old' PD1's energy_diff in 'c)')

b) 'new' w/ patch set, PD0
task_util = 13
cost = 480 * 1000 = 480000
sum_util_w/o_task = 215
sum_util_w_task = 228
energy_w/o_task = 480000 * 215 / 1024 = 100781
energy_w_task = 480000 * 228 / 1024  = 106875
energy_diff = 106875 - 100781 = 6094
(this is not equal to 'new' PD1's energy_diff in 'd)')

c) 'old' w/o patch set, PD1
task_util = 13
cost = 160
sum_util_w/o_task = 283
sum_util_w_task = 293
scale_cpu = 355
energy_w/o_task = 160 * 283 / 355 = 127.55 => 127
energy_w_task = 160 * 296 / 355 = 133.41 => 133
energy_diff = 133 - 127 = 6
(this is equal to 'old' PD0's energy_diff in 'a)')

d) 'new' w/ patch set, PD1
task_util = 13
cost = 160 * 1000 = 160000
sum_util_w/o_task = 283
sum_util_w_task = 293
scale_cpu = 355
energy_w/o_task = 160000 * 283 / 355 = 127549
energy_w_task = 160000 * 296 / 355 =   133408
energy_diff = 133408 - 127549 = 5859
(this is not equal to 'new' PD0's energy_diff in 'b)')

2. Difference in the 6% energy margin filter at the end of
find_energy_efficient_cpu(). With this patch the margin comparison also
has better resolution, so it's possible to have better task placement
thanks to that.

Fixes: 27871f7a8a ("PM: Introduce an Energy Model management framework")
Reported-by: CCJ Yeh <CCj.Yeh@mediatek.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15 09:47:33 +02:00
Waiman Long 9fdac650c4 cgroup/cpuset: Fix a partition bug with hotplug
[ Upstream commit 15d428e6fe77fffc3f4fff923336036f5496ef17 ]

In cpuset_hotplug_workfn(), the detection of whether the cpu list
has been changed is done by comparing the effective cpus of the top
cpuset with the cpu_active_mask. However, in the rare case that just
all the CPUs in the subparts_cpus are offlined, the detection fails
and the partition states are not updated correctly. Fix it by forcing
the cpus_updated flag to true in this particular case.

Fixes: 4b842da276 ("cpuset: Make CPU hotplug work with partition")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15 09:47:32 +02:00
He Fengqing 004778bf39 bpf: Fix potential memleak and UAF in the verifier.
[ Upstream commit 75f0fc7b48ad45a2e5736bcf8de26c8872fe8695 ]

In bpf_patch_insn_data(), we first use the bpf_patch_insn_single() to
insert new instructions, then use adjust_insn_aux_data() to adjust
insn_aux_data. If the old env->prog have no enough room for new inserted
instructions, we use bpf_prog_realloc to construct new_prog and free the
old env->prog.

There have two errors here. First, if adjust_insn_aux_data() return
ENOMEM, we should free the new_prog. Second, if adjust_insn_aux_data()
return ENOMEM, bpf_patch_insn_data() will return NULL, and env->prog has
been freed in bpf_prog_realloc, but we will use it in bpf_check().

So in this patch, we make the adjust_insn_aux_data() never fails. In
bpf_patch_insn_data(), we first pre-malloc memory for the new
insn_aux_data, then call bpf_patch_insn_single() to insert new
instructions, at last call adjust_insn_aux_data() to adjust
insn_aux_data.

Fixes: 8041902dae ("bpf: adjust insn_aux_data when patching insns")
Signed-off-by: He Fengqing <hefengqing@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210714101815.164322-1-hefengqing@huawei.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15 09:47:30 +02:00
Zhen Lei 57c8e2ea47 genirq/timings: Fix error return code in irq_timings_test_irqs()
[ Upstream commit 290fdc4b7ef14e33d0e30058042b0e9bfd02b89b ]

Return a negative error code from the error handling case instead of 0, as
done elsewhere in this function.

Fixes: f52da98d90 ("genirq/timings: Add selftest for irqs circular buffer")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210811093333.2376-1-thunder.leizhen@huawei.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15 09:47:29 +02:00
Quentin Perret 449884aeb3 sched: Fix UCLAMP_FLAG_IDLE setting
[ Upstream commit ca4984a7dd863f3e1c0df775ae3e744bff24c303 ]

The UCLAMP_FLAG_IDLE flag is set on a runqueue when dequeueing the last
uclamp active task (that is, when buckets.tasks reaches 0 for all
buckets) to maintain the last uclamp.max and prevent blocked util from
suddenly becoming visible.

However, there is an asymmetry in how the flag is set and cleared which
can lead to having the flag set whilst there are active tasks on the rq.
Specifically, the flag is cleared in the uclamp_rq_inc() path, which is
called at enqueue time, but set in uclamp_rq_dec_id() which is called
both when dequeueing a task _and_ in the update_uclamp_active() path. As
a result, when both uclamp_rq_{dec,ind}_id() are called from
update_uclamp_active(), the flag ends up being set but not cleared,
hence leaving the runqueue in a broken state.

Fix this by clearing the flag in update_uclamp_active() as well.

Fixes: e496187da7 ("sched/uclamp: Enforce last task's UCLAMP_MAX")
Reported-by: Rick Yiu <rickyiu@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20210805102154.590709-2-qperret@google.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15 09:47:28 +02:00
Thomas Gleixner cc608af36e hrtimer: Ensure timerfd notification for HIGHRES=n
[ Upstream commit 8c3b5e6ec0fee18bc2ce38d1dfe913413205f908 ]

If high resolution timers are disabled the timerfd notification about a
clock was set event is not happening for all cases which use
clock_was_set_delayed() because that's a NOP for HIGHRES=n, which is wrong.

Make clock_was_set_delayed() unconditially available to fix that.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.196661266@linutronix.de
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15 09:47:26 +02:00
Thomas Gleixner a845787830 hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns()
[ Upstream commit 627ef5ae2df8eeccb20d5af0e4cfa4df9e61ed28 ]

If __hrtimer_start_range_ns() is invoked with an already armed hrtimer then
the timer has to be canceled first and then added back. If the timer is the
first expiring timer then on removal the clockevent device is reprogrammed
to the next expiring timer to avoid that the pending expiry fires needlessly.

If the new expiry time ends up to be the first expiry again then the clock
event device has to reprogrammed again.

Avoid this by checking whether the timer is the first to expire and in that
case, keep the timer on the current CPU and delay the reprogramming up to
the point where the timer has been enqueued again.

Reported-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135157.873137732@linutronix.de
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15 09:47:26 +02:00
Frederic Weisbecker c322a963d5 posix-cpu-timers: Force next expiration recalc after itimer reset
[ Upstream commit 406dd42bd1ba0c01babf9cde169bb319e52f6147 ]

When an itimer deactivates a previously armed expiration, it simply doesn't
do anything. As a result the process wide cputime counter keeps running and
the tick dependency stays set until it reaches the old ghost expiration
value.

This can be reproduced with the following snippet:

	void trigger_process_counter(void)
	{
		struct itimerval n = {};

		n.it_value.tv_sec = 100;
		setitimer(ITIMER_VIRTUAL, &n, NULL);
		n.it_value.tv_sec = 0;
		setitimer(ITIMER_VIRTUAL, &n, NULL);
	}

Fix this with resetting the relevant base expiration. This is similar to
disarming a timer.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210726125513.271824-4-frederic@kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15 09:47:26 +02:00
Sergey Senozhatsky 28996dbb8a rcu/tree: Handle VM stoppage in stall detection
[ Upstream commit ccfc9dd6914feaa9a81f10f9cce56eb0f7712264 ]

The soft watchdog timer function checks if a virtual machine
was suspended and hence what looks like a lockup in fact
is a false positive.

This is what kvm_check_and_clear_guest_paused() does: it
tests guest PVCLOCK_GUEST_STOPPED (which is set by the host)
and if it's set then we need to touch all watchdogs and bail
out.

Watchdog timer function runs from IRQ, so PVCLOCK_GUEST_STOPPED
check works fine.

There is, however, one more watchdog that runs from IRQ, so
watchdog timer fn races with it, and that watchdog is not aware
of PVCLOCK_GUEST_STOPPED - RCU stall detector.

apic_timer_interrupt()
 smp_apic_timer_interrupt()
  hrtimer_interrupt()
   __hrtimer_run_queues()
    tick_sched_timer()
     tick_sched_handle()
      update_process_times()
       rcu_sched_clock_irq()

This triggers RCU stalls on our devices during VM resume.

If tick_sched_handle()->rcu_sched_clock_irq() runs on a VCPU
before watchdog_timer_fn()->kvm_check_and_clear_guest_paused()
then there is nothing on this VCPU that touches watchdogs and
RCU reads stale gp stall timestamp and new jiffies value, which
makes it think that RCU has stalled.

Make RCU stall watchdog aware of PVCLOCK_GUEST_STOPPED and
don't report RCU stalls when we resume the VM.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15 09:47:26 +02:00
Dietmar Eggemann b7c560ae51 sched/deadline: Fix missing clock update in migrate_task_rq_dl()
[ Upstream commit b4da13aa28d4fd0071247b7b41c579ee8a86c81a ]

A missing clock update is causing the following warning:

rq->clock_update_flags < RQCF_ACT_SKIP
WARNING: CPU: 112 PID: 2041 at kernel/sched/sched.h:1453
sub_running_bw.isra.0+0x190/0x1a0
...
CPU: 112 PID: 2041 Comm: sugov:112 Tainted: G W 5.14.0-rc1 #1
Hardware name: WIWYNN Mt.Jade Server System
B81.030Z1.0007/Mt.Jade Motherboard, BIOS 1.6.20210526 (SCP:
1.06.20210526) 2021/05/26
...
Call trace:
  sub_running_bw.isra.0+0x190/0x1a0
  migrate_task_rq_dl+0xf8/0x1e0
  set_task_cpu+0xa8/0x1f0
  try_to_wake_up+0x150/0x3d4
  wake_up_q+0x64/0xc0
  __up_write+0xd0/0x1c0
  up_write+0x4c/0x2b0
  cppc_set_perf+0x120/0x2d0
  cppc_cpufreq_set_target+0xe0/0x1a4 [cppc_cpufreq]
  __cpufreq_driver_target+0x74/0x140
  sugov_work+0x64/0x80
  kthread_worker_fn+0xe0/0x230
  kthread+0x138/0x140
  ret_from_fork+0x10/0x18

The task causing this is the `cppc_fie` DL task introduced by
commit 1eb5dde674f5 ("cpufreq: CPPC: Add support for frequency
invariance").

With CONFIG_ACPI_CPPC_CPUFREQ_FIE=y and schedutil cpufreq governor on
slow-switching system (like on this Ampere Altra WIWYNN Mt. Jade Arm
Server):

DL task `curr=sugov:112` lets `p=cppc_fie` migrate and since the latter
is in `non_contending` state, migrate_task_rq_dl() calls

  sub_running_bw()->__sub_running_bw()->cpufreq_update_util()->
  rq_clock()->assert_clock_updated()

on p.

Fix this by updating the clock for a non_contending task in
migrate_task_rq_dl() before calling sub_running_bw().

Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20210804135925.3734605-1-dietmar.eggemann@arm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15 09:47:26 +02:00
Quentin Perret bba2b82d1b sched/deadline: Fix reset_on_fork reporting of DL tasks
[ Upstream commit f95091536f78971b269ec321b057b8d630b0ad8a ]

It is possible for sched_getattr() to incorrectly report the state of
the reset_on_fork flag when called on a deadline task.

Indeed, if the flag was set on a deadline task using sched_setattr()
with flags (SCHED_FLAG_RESET_ON_FORK | SCHED_FLAG_KEEP_PARAMS), then
p->sched_reset_on_fork will be set, but __setscheduler() will bail out
early, which means that the dl_se->flags will not get updated by
__setscheduler_params()->__setparam_dl(). Consequently, if
sched_getattr() is then called on the task, __getparam_dl() will
override kattr.sched_flags with the now out-of-date copy in dl_se->flags
and report the stale value to userspace.

To fix this, make sure to only copy the flags that are relevant to
sched_deadline to and from the dl_se->flags field.

Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210727101103.2729607-2-qperret@google.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15 09:47:26 +02:00
Peter Zijlstra a5e42516a6 locking/mutex: Fix HANDOFF condition
[ Upstream commit 048661a1f963e9517630f080687d48af79ed784c ]

Yanfei reported that setting HANDOFF should not depend on recomputing
@first, only on @first state. Which would then give:

  if (ww_ctx || !first)
    first = __mutex_waiter_is_first(lock, &waiter);
  if (first)
    __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);

But because 'ww_ctx || !first' is basically 'always' and the test for
first is relatively cheap, omit that first branch entirely.

Reported-by: Yanfei Xu <yanfei.xu@windriver.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Reviewed-by: Yanfei Xu <yanfei.xu@windriver.com>
Link: https://lore.kernel.org/r/20210630154114.896786297@infradead.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15 09:47:25 +02:00
Andrey Zhizhikin e4dba8435a This is the 5.4.145 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmE9pNwACgkQONu9yGCS
 aT5g2Q//ZEiIQBvw6uZEA3z1Y6tuKFyIxxwOu24EbCTvB1oXXQX/XXQQPWori1Ny
 OpjuQwXJr2LW+/wEKvUEj8mTrpFD+LsZmXRLBHCw9EqqD5RURDqUZt+xh+4xtV/S
 2EgF0nCO9+84wo628Lc1C2LBJZZEo/kD7LnGeln+BXwRS1FQvGfD+5KIpOR2YzqI
 hrCtVfO5ZQpv59PrAQkfwnfITk9BM1cwA7LCD75WEN59405ZV3mzFyTdvz8s0iGt
 akxmwajLNGjQ/ro567tjpsWiK7EF26mNRTMZqu1jK6h/KjU9sQ4DzCqB+p5TPh/9
 mj/Rzq1lSjLodsR0OznKBqFIVaqXyTU+0cMItjos9MBsG/4GOj8ixbXdFRG99WmK
 bNsYucotSrE9ApYwsmqYaNcHcGeLUIsYCDFCQp3++oeF59+FA7Pp7B4bI/zcYRwY
 aqbfTkMzo8/e4OF0B2LCx+8r0xol1SoLwBfcP3hb7rlKp9OkSYsKrJ/29CUuINe1
 YC5HdrPf2HP36jlVCll5rQa+ERaxtNSCozgwxHG/x2yeOmiVqxdE+vUUmyRidah8
 DvYklCM7upUDi1ujbOwbor9R1jQSXkWMFK76EBB3GJPgguFNyczFXm8xBzfRLQvw
 H6YjIfnxNt+DLPn5uXIEhU7ISTkUno9i1BEd2NoeT1UiYTlk2bw=
 =/lic
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmE+VUgACgkQ7G51OISz
 Hs0D+A//c+QInKUmZ5jcHXXqaiD0JYt5+arGOIBAgIjRcWDL3yJpuuP5nwZRnuRf
 N0SyJjZIpb/WyS27zEaHwOg7xKkhCqagix/J9qdNDP94W0xoGKdXsxotnixXXjMo
 tY9RUqwA45i412x6RA4sXHBmyhgPwSuoLnHk+qVF4xOBD2rshVXzv6zQAO51Lg4J
 3/a2agXlrreCGtHmnKwn47pW+mkHEaquCmcXaz/6PNbWDHwZpmmtCRlsvMW1PyMe
 7TZ/mQ3fofSX3bxuzzYWmddcGTNtwe/IsmEFZmhC3Vl4BGtqE2YKj3439zca1572
 6ttLDy9K3pAg/rf0NhIe7nIlogmJyunP6U8CVnuAwH0esixWxyXSSXUKTOXhKYFG
 jgv/+adhqjSZQhm1p1HY8KItsJE8o5m87b2UKMe1UKGExAJ66qJsgZtuufbEDA5r
 S+qnQEzE4lfaYmCH6WBtn6a1fodONw/KmsJwk5UhNlv2BVS8LQpcpADVRmq/mCDj
 FpCe7uFJVaDnxpUc1Trv6D6YJpL6n1KIUQcVyExtmBtpuKK/KTLs+NP5hpUZ9xtV
 ftAEiZkplJP+qACdCeZt5QukiHBIfds32sUZ4PDp3ANbMmY6UpTWGQyZq+7PNL4d
 +OzCCDzlXRdZioua6z8oTaAkabo2pcK5ao7GXSor35mU6JZvMVg=
 =Z0Kz
 -----END PGP SIGNATURE-----

Merge tag 'v5.4.145' into 5.4-2.3.x-imx

This is the 5.4.145 stable release

Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>
2021-09-12 19:30:13 +00:00
Peter Zijlstra 56c77c1b52 kthread: Fix PF_KTHREAD vs to_kthread() race
commit 3a7956e25e1d7b3c148569e78895e1f3178122a9 upstream.

The kthread_is_per_cpu() construct relies on only being called on
PF_KTHREAD tasks (per the WARN in to_kthread). This gives rise to the
following usage pattern:

	if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p))

However, as reported by syzcaller, this is broken. The scenario is:

	CPU0				CPU1 (running p)

	(p->flags & PF_KTHREAD) // true

					begin_new_exec()
					  me->flags &= ~(PF_KTHREAD|...);
	kthread_is_per_cpu(p)
	  to_kthread(p)
	    WARN(!(p->flags & PF_KTHREAD) <-- *SPLAT*

Introduce __to_kthread() that omits the WARN and is sure to check both
values.

Use this to remove the problematic pattern for kthread_is_per_cpu()
and fix a number of other kthread_*() functions that have similar
issues but are currently not used in ways that would expose the
problem.

Notably kthread_func() is only ever called on 'current', while
kthread_probe_data() is only used for PF_WQ_WORKER, which implies the
task is from kthread_create*().

Fixes: ac687e6e8c26 ("kthread: Extract KTHREAD_IS_PER_CPU")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com>
Link: https://lkml.kernel.org/r/YH6WJc825C4P0FCK@hirez.programming.kicks-ass.net
Signed-off-by: Patrick Schaaf <bof@bof.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-12 08:56:39 +02:00
Andrey Zhizhikin 79c30f58eb This is the 5.4.144 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmEx2AAACgkQONu9yGCS
 aT7csg//ZhXXfRkPNMhpkkMjcV7F825mLAPs1vsluIEIZ0oInOpegu8SyDENOfui
 HyFLZ/2Stewa0mn7kNS1caAUXLpFvZ087sIz/SipzupFjLTUHFsNcMYrd19R1M4h
 UK/owAJeoq/pgR4kUck4o/r+47lo8CMqkscbEdKSvwxYUeANIcbGVB5Sf2UaJr5S
 lqBZeliWY/jYGvLWBoSc7mvUwWRbkKLnQu2JkfvGKM4ODOzpbh8TUhq8NxEL7ZFn
 mZxtNmWPvG2PHHvNP89pwKnKQx70ySKrlQdDv10gL6nIHhKuqwLxBo28Q+KcKMYr
 vfoOFS5Vk35jA7Xt8LhNF+lQtDTbN+2YLeDtoAq+aWMmEW/RUYXSU/3thh+WFuO5
 uZZAbrh4r3bew+PLFpEtnVjxkpMsU9EC33KuIZXIGlDEkFlEneJ9pMQYH7XIwQnV
 5sSSOnbyzkajxv9Kpu6XEg3kKyJf+gk/AB/psgfMR0v/jQ4PXVk9+cZDZxKFcxjj
 wGywDkgIb+/sPrABWici/yXjIup0OSG1fK9/Ki9uLgNzxXZ0h+4e3DcXNMxs1B/p
 GpBPP773qIff2lEDhAI+SbP8pHj5Mnc1j77WUQTU9vsIJcftYm4i0G+POpXnynzx
 gzbjJjOhTBL57OciLQlmL2s5ZZUPgPvu5VoHsRfwOu/bbarRADE=
 =RA6W
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmEx8s8ACgkQ7G51OISz
 Hs1JZA/8Cj7g56QymgMuXHEB1PecU7pLpO5egRK3X6xHxJwksD7Xp2LfpaRxjzGw
 XNQsp+4mbJX4oHiZPjD/RsFOdVuNU3ff3mliSmoH2Tdepa2TuKFt7T8V3GE7FN6K
 ns52rvIzbhF762nL1Vs+LE0YBq1w6rTvL7eenNxMo9pwUxJv95X91v7BpRQjTAY5
 /ngvj8tRKN10dSokwrCpzk47Sj/jhSoLlckJL7+iOopQdhOo/HTfWj1aPCaZC/AX
 q2EUg/L2GB1Ij342lDNEZSWn2xAvuAT6+45R8p3GxyG6TMihwiKGXQM922MJDZAV
 T3Chxgu//OlB/spPMAuFgfBNqaX1z+zxv3Dc1EvEbSNPhn6PwEZ2ck9hYkuPmvI3
 78dkyqj3x3AR5VKvc/CpnqSokXBjV7B1TOxJlHKvJ77lvWuDwujir+chmULjahA8
 bVPpbBC9BfF/nX0cYsjQuDNyddqTpt3cv1Cp9w5gXhs/Nj5MsNDRyZxaVHlGaI/W
 h3N6rAU2cNDDtI4Zqr8Lo5IgBLMVUPuj9ZUNUJBKq3YX5CooEmjCtBKZch55Ou5h
 6xmcaMgrFre3FHKfvVhJ5ACK/DoPuWLvr4Af4Q0v6kgif81Is4LQQ+EEXRLMryi2
 fTv2X5r2GLoMHlUH0WgBW8pY0NuoCJiZuxCY5T1c61pCbDCpRss=
 =yagG
 -----END PGP SIGNATURE-----

Merge tag 'v5.4.144' into 5.4-2.3.x-imx

This is the 5.4.144 stable release

Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>
2021-09-03 10:02:52 +00:00
Richard Guy Briggs 0634c0f919 audit: move put_tree() to avoid trim_trees refcount underflow and UAF
commit 67d69e9d1a6c889d98951c1d74b19332ce0565af upstream.

AUDIT_TRIM is expected to be idempotent, but multiple executions resulted
in a refcount underflow and use-after-free.

git bisect fingered commit fb041bb7c0a9	("locking/refcount: Consolidate
implementations of refcount_t") but this patch with its more thorough
checking that wasn't in the x86 assembly code merely exposed a previously
existing tree refcount imbalance in the case of tree trimming code that
was refactored with prune_one() to remove a tree introduced in
commit 8432c70062 ("audit: Simplify locking around untag_chunk()")

Move the put_tree() to cover only the prune_one() case.

Passes audit-testsuite and 3 passes of "auditctl -t" with at least one
directory watch.

Cc: Jan Kara <jack@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Seiji Nishikawa <snishika@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 8432c70062 ("audit: Simplify locking around untag_chunk()")
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
[PM: reformatted/cleaned-up the commit description]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-03 10:08:16 +02:00
Andrii Nakryiko 38adbf21f3 bpf: Fix cast to pointer from integer of different size warning
commit 2dedd7d2165565bafa89718eaadfc5d1a7865f66 upstream.

Fix "warning: cast to pointer from integer of different size" when
casting u64 addr to void *.

Fixes: a23740ec43ba ("bpf: Track contents of read-only maps as scalars")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191011172053.2980619-1-andriin@fb.com
Cc: Rafael David Tinoco <rafaeldtinoco@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-03 10:08:16 +02:00
Andrii Nakryiko 812ee47ad7 bpf: Track contents of read-only maps as scalars
commit a23740ec43ba022dbfd139d0fe3eff193216272b upstream.

Maps that are read-only both from BPF program side and user space side
have their contents constant, so verifier can track referenced values
precisely and use that knowledge for dead code elimination, branch
pruning, etc. This patch teaches BPF verifier how to do this.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191009201458.2679171-2-andriin@fb.com
Signed-off-by: Rafael David Tinoco <rafaeldtinoco@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-03 10:08:15 +02:00
Andrey Zhizhikin 49ef8ef4cd Linux 5.4.143
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE4n5dijQDou9mhzu83qZv95d3LNwFAmEnj2AACgkQ3qZv95d3
 LNxNEQ//auFOSmgsMtI8LDmKlP/f22+FmICk8+IHeBMRBMDY0WGEEdsRZgcf4R7M
 hgyBn8ISmU5W0idpoxzVTiNxDJ0YVbVSIX12lZO6OHnwcv6hNW7iOW5TaGjd8EN+
 fkh8MtAToBQrp4fFb1QkC11pYNMPiuvDNB2nW+F3ixfYLyC1EF4g2/qVUKy7s6rZ
 dbqDfuI3Q7R2opsIkpmsV7ClKGbJzsP7oo0H5EOQMpmOowhg3oJy8oYqMMTgij1T
 bJU8kujNElsK+/nbpVzJPrpprQH9eGP+hB5ZAv6s/FuJ6RmkoAczYQnX3HL6TfCS
 ymoyJ01gsmDic9RnG6qei5LkCwf5Td2SKjRZdqGWKTluWD1ZAwzUX8Ww6K+t5uWk
 PQPyCfU2wk2D3JjJWt0vTxl/GZGAkYbZpy5ISZFJhK7/j9/oTSrPWra7/BRu4K2I
 2PK7XGjNyQxSguQqmG064Q0nYEOU03pR2H8tyG3iH0nBBd9p54D0Bg0D73I2h0az
 PoGhBo71m9SYCPP1zSXl+xLFyWGZZDUYCaU9KPlwkYCCcRUSQbfCKwrYHfEcHZgL
 4QtYlpUi+/C0Ga7gAK9ierqCKSNTOpoVna618j97uqCYVIU8estLBqX4mMAQquVF
 R8+cy6L/aTBVw4Zwd0Jmt85GwBHlHahUGEq87+Qpw/laqjkBFcg=
 =SIVq
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmEorqsACgkQ7G51OISz
 Hs3MNQ/9GOK/4DrXYH4RTeGJG2+tTk8VMOrmoDxQOkGXggLvpw06gP1Q9Dz8qzee
 hkILNVr4xC1qxzBzCIAOwzSM0DOQeGJrjtlDPOHxFB6akmwfZ9mU7W4k3YaBHz2c
 +pp4YM0YalmVoJSDBVyFrN67q8gorK39yPgi3BC2tUAz9OSJfPsmdKwqe8ICe7Lp
 hq7R8VfeZdcmYW2vRF5v2yOlzg9vlEd2JfXVL+LJ3R9Eo2Ytlam3gaeObJJqtjDw
 MObvvSLeG2QjZ38tvrWjudfR1Z0hDFy/E1E1AI4y9STLHHQj0eM+3dzEP1mugVVk
 bvLeas2Raf8IA0tJfNQIgz54DcrCT7vHGblKkqESg6kJ/fVc+7CxaLf72DdU0KrN
 0IzRHu3TX8LCj1BQax6eDWlysa3k837WkRMLK55NKPhqkDCb3IhpI3YB+rVgQUoa
 DpjmZ+sTDPypasZf2rGTdGosZBccsRdNbohmOuR202rKOeMnndGJRQrAwdYym/WO
 C9pKosNwBCFk4I97chihvamN/vH0MEJkICwR858I6+uCAMPISNqYcSbRYC3YXoyy
 VU0JBrp4hM5IlKvxKoRuT/Pplt54xhoWREXOS1ua+BOhcWox85QYt+d7uLxt02iZ
 4u9IEXwy0QM+qxZaWw+9+IP4DfOgVa3BAsrVzRfAoHrbwMte7xY=
 =W2Mz
 -----END PGP SIGNATURE-----

Merge tag 'v5.4.143' into 5.4-2.3.x-imx

Linux 5.4.143

Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>
2021-08-27 09:21:44 +00:00
Steven Rostedt (VMware) 20c2f141b1 tracing / histogram: Fix NULL pointer dereference on strcmp() on NULL event name
[ Upstream commit 5acce0bff2a0420ce87d4591daeb867f47d552c2 ]

The following commands:

 # echo 'read_max u64 size;' > synthetic_events
 # echo 'hist:keys=common_pid:count=count:onmax($count).trace(read_max,count)' > events/syscalls/sys_enter_read/trigger

Causes:

 BUG: kernel NULL pointer dereference, address: 0000000000000000
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] PREEMPT SMP
 CPU: 4 PID: 1763 Comm: bash Not tainted 5.14.0-rc2-test+ #155
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01
v03.03 07/14/2016
 RIP: 0010:strcmp+0xc/0x20
 Code: 75 f7 31 c0 0f b6 0c 06 88 0c 02 48 83 c0 01 84 c9 75 f1 4c 89 c0
c3 0f 1f 80 00 00 00 00 31 c0 eb 08 48 83 c0 01 84 d2 74 0f <0f> b6 14 07
3a 14 06 74 ef 19 c0 83 c8 01 c3 31 c0 c3 66 90 48 89
 RSP: 0018:ffffb5fdc0963ca8 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: ffffffffb3a4e040 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: ffff9714c0d0b640 RDI: 0000000000000000
 RBP: 0000000000000000 R08: 00000022986b7cde R09: ffffffffb3a4dff8
 R10: 0000000000000000 R11: 0000000000000000 R12: ffff9714c50603c8
 R13: 0000000000000000 R14: ffff97143fdf9e48 R15: ffff9714c01a2210
 FS:  00007f1fa6785740(0000) GS:ffff9714da400000(0000)
knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 000000002d863004 CR4: 00000000001706e0
 Call Trace:
  __find_event_file+0x4e/0x80
  action_create+0x6b7/0xeb0
  ? kstrdup+0x44/0x60
  event_hist_trigger_func+0x1a07/0x2130
  trigger_process_regex+0xbd/0x110
  event_trigger_write+0x71/0xd0
  vfs_write+0xe9/0x310
  ksys_write+0x68/0xe0
  do_syscall_64+0x3b/0x90
  entry_SYSCALL_64_after_hwframe+0x44/0xae
 RIP: 0033:0x7f1fa6879e87

The problem was the "trace(read_max,count)" where the "count" should be
"$count" as "onmax()" only handles variables (although it really should be
able to figure out that "count" is a field of sys_enter_read). But there's
a path that does not find the variable and ends up passing a NULL for the
event, which ends up getting passed to "strcmp()".

Add a check for NULL to return and error on the command with:

 # cat error_log
  hist:syscalls:sys_enter_read: error: Couldn't create or find variable
  Command: hist:keys=common_pid:count=count:onmax($count).trace(read_max,count)
                                ^
Link: https://lkml.kernel.org/r/20210808003011.4037f8d0@oasis.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 50450603ec tracing: Add 'onmax' hist trigger action support
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-26 08:36:20 -04:00
Ilya Leoshkevich 1fe038030c bpf: Clear zext_dst of dead insns
[ Upstream commit 45c709f8c71b525b51988e782febe84ce933e7e0 ]

"access skb fields ok" verifier test fails on s390 with the "verifier
bug. zext_dst is set, but no reg is defined" message. The first insns
of the test prog are ...

   0:	61 01 00 00 00 00 00 00 	ldxw %r0,[%r1+0]
   8:	35 00 00 01 00 00 00 00 	jge %r0,0,1
  10:	61 01 00 08 00 00 00 00 	ldxw %r0,[%r1+8]

... and the 3rd one is dead (this does not look intentional to me, but
this is a separate topic).

sanitize_dead_code() converts dead insns into "ja -1", but keeps
zext_dst. When opt_subreg_zext_lo32_rnd_hi32() tries to parse such
an insn, it sees this discrepancy and bails. This problem can be seen
only with JITs whose bpf_jit_needs_zext() returns true.

Fix by clearning dead insns' zext_dst.

The commits that contributed to this problem are:

1. 5aa5bd14c5 ("bpf: add initial suite for selftests"), which
   introduced the test with the dead code.
2. 5327ed3d44 ("bpf: verifier: mark verified-insn with
   sub-register zext flag"), which introduced the zext_dst flag.
3. 83a2881903f3 ("bpf: Account for BPF_FETCH in
   insn_has_def32()"), which introduced the sanity check.
4. 9183671af6db ("bpf: Fix leakage under speculation on
   mispredicted branches"), which bisect points to.

It's best to fix this on stable branches that contain the second one,
since that's the point where the inconsistency was introduced.

Fixes: 5327ed3d44 ("bpf: verifier: mark verified-insn with sub-register zext flag")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210812151811.184086-2-iii@linux.ibm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-26 08:36:17 -04:00
Andrey Zhizhikin eb3365561f This is the 5.4.142 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmEcr3MACgkQONu9yGCS
 aT6eag//f6COc7PQMKCJU7hcw0Xe4pIPmUUj+EpkwztzfX45dCzWxbhHxRiOqtKa
 ReSUXZ8mJLzYJgyHRr6FfsUqENWzKGqHby15yZ2h0rEyJns/V054NiBjz1aWoQZ4
 axpF1SYaLfLglfobslLc/3+JbyTfxBK/+m6XnZRJXqMMFnJ+hljJJxXuCryEQnU6
 KtAlrS2ITpbEyAECAE02oErxGDGCnTDzGpQvlSeJWqJVlisrsGIvGowjFliy6ONf
 YDjsejKlNUlQwnplXErefuXf7uhT/36sN0DnxCy5yXJ8SJwnzja3eYDz1yG9apG0
 ZR7KM3dN3L8viuRx2GEOubh8EMbirErD9DrhaPyaNhEPKHI2cHxHdG2prj5WkBzZ
 OjXcW32FDWzw6/kfnHEOBl0OrmhsEIY1/pP8jegape8lDrj/szN0ViJe0rzElba1
 6pb0D/ASFPYtYwR1O2/qZiPqqzHQEAFfDyDMKEKzogbNAHUbfvaE6g3qYafwQgS6
 o+g/BBxtrGNaIWtMtQ75aeoqFA4mkE9MrLJ1SzEFpw/PvHCHtFItCyEcUwaNvEz8
 OdwceDSIkT4Bn0GzEwuxKxcFyZ1R3rIABPIUGbid8Q3w6ZgM/vr2BPR3vkUY0zl+
 g9DPae9S4K8A+kYGwyeYzZ0dPC6otb8h01RtiGJyQgyiPAeTutk=
 =QJcB
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmEc1fYACgkQ7G51OISz
 Hs0LrQ//cvoxuCT2kOcr+wqobXo3v9d/xEypsLwVaKrg0k8kjLGrqDUDqo7yFm4o
 WydvPqL4xd0m7H3+PMSojwydq5o4Jx7x3+pGOt19xrdunGSj8R2eKD/4sboTWM3M
 w1p1cjladgX2rcbdpuh7wVVB6VFG49QVZJpum/nSvfkt4n9rhza4rT9/2wVcfJ1m
 QwI1F66lbGm8zQ8bAHJyflZz6QbJWbAlbEZwKM7SWU+hR5LW9wyOXdFgW6p5r2LM
 9FCnLA2K2wVuXUZcUKXXBDpctIeoUS8dc3MaXFKRMgjdCwMvwWp5GYUuGQIkSsRB
 xE25Sq1R22Jx2mnq4V9EVAUEN7KCVHQKVprrlDp1aPDP/xndlXDMXHf5oLW/12O7
 O1p2XehdxcTA/KZEETTLdiMx23Gku0NjzjKuZGTc0Op+iwF7VjHVvSOgrw+TtTMn
 05wwH472TFwnpXo9HzT1ugh8RnmeF2qD3fNIzwaQIAgDSuMG8Xdu8V34UiMK3Gpc
 /kXoAzPQgpAlzyAd2cP4IuPYs8PRNFskeZmH6rbrhwYGCZ6IhN/dJnNAD6LWKn6d
 99UxBoIPKVgi191/Gx+L6J1+Bi8Ulov8qMNVBp8Bv8RDLdGhozIxu7GzNxvklEVZ
 s4mvZ7mVgtsXGxgbQWnRkKVkJMkQ3P9j2B0b0Cjsmz+1OOw/9xw=
 =be34
 -----END PGP SIGNATURE-----

Merge tag 'v5.4.142' into 5.4-2.3.x-imx

This is the 5.4.142 stable release

Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>
2021-08-18 09:42:11 +00:00
Ben Dai 0c8dea3fd5 genirq/timings: Prevent potential array overflow in __irq_timings_store()
commit b9cc7d8a4656a6e815852c27ab50365009cb69c1 upstream.

When the interrupt interval is greater than 2 ^ PREDICTION_BUFFER_SIZE *
PREDICTION_FACTOR us and less than 1s, the calculated index will be greater
than the length of irqs->ema_time[]. Check the calculated index before
using it to prevent array overflow.

Fixes: 23aa3b9a6b ("genirq/timings: Encapsulate storing function")
Signed-off-by: Ben Dai <ben.dai@unisoc.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210425150903.25456-1-ben.dai9703@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-18 08:57:02 +02:00
Bixuan Cui 4dfe809271 genirq/msi: Ensure deactivation on teardown
commit dbbc93576e03fbe24b365fab0e901eb442237a8a upstream.

msi_domain_alloc_irqs() invokes irq_domain_activate_irq(), but
msi_domain_free_irqs() does not enforce deactivation before tearing down
the interrupts.

This happens when PCI/MSI interrupts are set up and never used before being
torn down again, e.g. in error handling pathes. The only place which cleans
that up is the error handling path in msi_domain_alloc_irqs().

Move the cleanup from msi_domain_alloc_irqs() into msi_domain_free_irqs()
to cure that.

Fixes: f3b0946d62 ("genirq/msi: Make sure PCI MSIs are activated early")
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210518033117.78104-1-cuibixuan@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-18 08:57:02 +02:00
Thomas Gleixner eda32c2188 genirq: Provide IRQCHIP_AFFINITY_PRE_STARTUP
commit 826da771291fc25a428e871f9e7fb465e390f852 upstream.

X86 IO/APIC and MSI interrupts (when used without interrupts remapping)
require that the affinity setup on startup is done before the interrupt is
enabled for the first time as the non-remapped operation mode cannot safely
migrate enabled interrupts from arbitrary contexts. Provide a new irq chip
flag which allows affected hardware to request this.

This has to be opt-in because there have been reports in the past that some
interrupt chips cannot handle affinity setting before startup.

Fixes: 1840475676 ("genirq: Expose default irq affinity mask (take 3)")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210729222542.779791738@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-18 08:57:02 +02:00
Andrey Zhizhikin 915e71b823 This is the 5.4.141 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmEY9Z4ACgkQONu9yGCS
 aT6Zcw/9Hw/LCgboD9SnkaHWx1vWwXilz4uqTp6+qfCUZmtjWnvORA/wUzAqhcwu
 1pO3aveUVLS4FyGQbn4vdMs5Zv7ZkDHchJ+FnloLAPnIUI8xce1YxKkwT8ysvvOF
 jFf2vPOr7PqCxTlGZ3MitJPhl4mkVghlwq6rk/EwFuqG8JRUiwL2jeHVX24VywJI
 E+XKOMJwXHdB+VWmmP5yvqDcPcQwYVAB01BBMWYEHFa4WTrRwICsQz6Y1319YAFT
 Sb0m2g/Pwv3YEzdgtCCTrB8FOr2Bjum/KoAZd+v9tQuFkZSwp14XhAlCWlYBSHUE
 AWoVc9w6KYNt4cCdem7P8oZWqAnkR9pLTwlg1XWZCQCTzqIp8YfsjTMONyKf/94O
 fHVTHTh7x0f2taGhAmzo52dIRBUBA8bGQ8F/t1XUc51RWn9CEBc230ich8BQtH1t
 hm3X9eB1CM9FRg4imIf9RW5X8+gkTSYNfrnpBh7IEKXIPiiKpDLyO3MN4IuFLBzN
 fRPhQacTxCq8QgO6eiS8M5cafglLL3U7GsE/qyiVQjVeoGNnLZINamdfmDLsgfGH
 f67pTwVNKNcoT1XJf1MqgDM2IoAcZcQqLtNhedva9cC74SiHv+nf28eDWwJJ5u10
 fdNJowrZdF828BCuee6UYUsOXJ6r8I6UybLuzpZzpvBvvMiYZH0=
 =edfG
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmEaTrMACgkQ7G51OISz
 Hs0cJw//SsJNcVX8LJxpcWj0NS4+HNz7tNZGJZ1MP4C1LjoMNolqJrHQccs9sbEW
 bfDSI1LnWd49Haoua+KMn5hysHzdxa9w/dhAkbNUopt1MFP7WKzpWjp93eOpydfc
 4smXDOByGyklyXeXf7xWwJo4BVXBC+OutMhXFMrTzagIIsi0K7p9ToedSVEDOIgn
 Sa22tOiDsolryAqSFHmn3mTc0ju1SPS2+wT1oxFuDzOWQJGHPXeaFl1sgIU8F7q5
 HixGaxMU9ofYI9L7pr4q8Gka3dSRfDpWGxG0sp9rjgI/AHX2sCNQbd7ueRPLJaEn
 A2AGDBVNVUK+YuTnBoHoHTK6GSR2gV656AK/EsSTPedant6id6I3PrffANWfDv0Y
 TtdxC68r2DQisUiOqFn7gCt0td456P+pASfJNuYy6yISA1Wsy1zjBULjc+YPqfVd
 IOIJV9PtE3VqRfdRsZtXA5G7QBzlTGtDs9EKPg3EzSgAn50aDt3cYrIx7XwoNe/7
 TkMQED/PR3Msr5qU60abkeqsMMaM37cV8ZFerOB4we/6MmD6aPbThzLza7JEvvjg
 IP4GlgcxzmBH6aFLJlGZa2Vw6ge4NzUgiRRjrNz3lZN+6R1zdZKj5ECEVtOrLpdV
 L1XLfT58wsKi3CFWfGWu4I3B6FONwaolg1D7tALH2fmK6D6q+TU=
 =dQtZ
 -----END PGP SIGNATURE-----

Merge tag 'v5.4.141' into 5.4-2.3.x-imx

This is the 5.4.141 stable release

Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>
2021-08-16 11:40:33 +00:00
Andrey Zhizhikin 49dc55b9cb This is the 5.4.140 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmEVBCYACgkQONu9yGCS
 aT53WxAAqljdZCHORMxU9rnAHSGNHMtGH3UA7TXDU3SKOYSDRW4FOxI3XUJzJLeW
 jWB/ZXRSeNmSpwFVmUNYhMkHP3VTXDp73xx2y8DI8U20ykiTeyO6Ed+zW8GluWBP
 uvvdtjV511wspCUiGKOnD88z9FKvfb5OQKxRb03XrwxQqo3JvWSB5QZhWaBP0UnW
 j6YWAQm/luvsjx0V4sW36mDj3FWihtlyFyh4Psa7yOdlu6whgLZdGMeSCqsGAcGx
 6SdshcXrMpJqU9op70a2WHbo8YYaEyLZ4bOK5FmXPfKokh7HmqHEXi7HuW2UcDmr
 hi3bR455LqQchw3a7OtiGaEF4liUnJw+EIQx1kaA330EvjlIUwayxdyTitZ/z+5c
 x9i3NS6bLFUL0FPl79tM5oyd7cR4ZSyrqIAVmE8Z+npCuk3XcKWgxfTvuPemgoBk
 89Lbpe+C/zWBkStZFmK8OHAv9iBhP/jR2TmRtRhgHJQkV5qCiXCHejb3g8jur99F
 q4a9AmvN2ignkejh0darNXk2VdfTBfWIVrXjhcncsHSHGcV4xbc1uDyqQad0aug5
 iRtmvkmYG0SruHFi3mF9KhKP1IjD0vI2uah6GeX0FLb8zQIuddNpkXSZMS/MZV0c
 pZicz6qB4JYT3AiiFEmfDtt1FGMwf1weZBmrfHE1OH1FWiZYC/w=
 =5ku+
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmEaTq4ACgkQ7G51OISz
 Hs2sKg//UmCY1/aMvA+3Tq2VmyyYN9Rp0NZdocQWTpw3yMEIla7JpxSqWWQi9/6U
 cawfBRYwoY1OnpQL/heyAptuV7/kZdaJMEpFd//DvdnDabnaxKMnTnRkyh+VdIw0
 vC9Bk/oHDK+ZTcNhbBqZVscmOJ3ox20t/ST/u4SeAq8dYew78AfAV4D1GjfN48Id
 18qDzCg+TX9CXxXUGTyX4V9G+MnBnfjeUcb1U2bsHqQ8uUCLtFVm5zc42u6GrD3x
 VDnh2WTnnhryc/fefitjUILVKvYRfDVTagERRKB9VldlXBVz0LxXcnmGfRMkeFR9
 zuL/9j4lOtCWaSoqpkXUpvpYgW35TJN+4EVeO8sUCqztzCyNAW9M4Qrf0OvC5aTE
 pi/v8b6BzuqJczPMBggk2SdetCqYvgJbeMS2nBZsgkAZk1zplUOEosSHWtToGFxo
 g2rPnHlxhBabTuQAXSQeV5wHs7h+cUhd7TSpWWpcEGRLP4qwXEfgw0ktLkGLxg1q
 9xQc/utISWlbv1bjqNPjbc8Vi3nX20PqWTVc+o3QFRGRU/9xqCYoKJ5wdZWe/8zR
 mRw55Rz460m8W28IFHiGFDpNB236wAcqgisiaEsHYGkpS1WYvaIdKXDwckCXFE2C
 6xbaMkfWU0z20MfbBxuhlv+Pipv+jrD3qtDQb60y57GP6BiRbOM=
 =3HAp
 -----END PGP SIGNATURE-----

Merge tag 'v5.4.140' into 5.4-2.3.x-imx

This is the 5.4.140 stable release

Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>
2021-08-16 11:40:28 +00:00
Andrey Zhizhikin bbdb668ff5 This is the 5.4.139 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmEPgf0ACgkQONu9yGCS
 aT6RQA//cKA0KmfDEwpNktHuGWMnhbjuf+WSsjqqoKRYCCdbBhc/HMTL05Xjbvpg
 VCrYchavp8lwvSd8d0cFMA4jcE1zjut+JzG08W1aIV0DJDflbCLlM8jzl/3Ft6c8
 CTWHRNEyBUw1ynaUVV/L+Vlox9GTk4SYY92pXX6Ciar0sJHLeXDw9VK/NUQG51d7
 ctfvro0D8JM0+HHG/CZM+wkmpMW5nUNCnBubsb3fp5Tpi2rMCyxVVyj+NwT+mYO5
 jCOl/DTMJBLFBqG53cwP/sEqTvLrqhCF3ZRPBi5hmLm5+NfvWz3Orlalfn0nFU0n
 n+7fKUH/LghuduXnxSMwAtbZUhP6rGqDwOnMJtqEiGJQloNC1f/ER1VNFvOG/bm0
 +SQBB6iR56Z+cnqKpyz41JdOsUk4Y2dDRA5bh1h5bw4ctfXDBgQ/OqXWHIboLlQg
 7BNlq1tQoUSu8IHhJtZJLtpdSLs6jtZ4nPtAeMjLDElYJIKtzCKhSkGnyWA8V5i/
 V07zDlYBFryyvBJcJEgNHLaZt7wh0MEDYinOlnxOzapG8JYabItmioFABGzOCXu4
 2QXCWEuIdMk+J79yQIGGUNSRKWwTyPoxBRbkbAHU0hXHI6R9V6V0/3Rp8hlcoPZd
 MSU77GD306j/+ekM04gNZrI0ploywEbqxcDoM2XBSXcZTrFxtdg=
 =FS2D
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEdQaENiSDAlGTDEbB7G51OISzHs0FAmEaTqUACgkQ7G51OISz
 Hs3zOA//REX8dd54cRbpgOAS+yYDcRuy+SfUTpKb+BFFWBEF599fezX7NKV+0ubB
 mvp3//bpd35NsmSV5rtD0+CcD4NYbLu2fO5zG+50y0HcxKRrgveifquitemjGKhg
 nMITs9F5ZvDCteMciW6zb6xZNfM6ehEpIGWtMuACgOm89AubGfrk9ZrCd4Wk/Uaw
 6pvPpDjzElOfJ8un7F1vwmcbbY/ApyvDdzYnZf6fKzNNfx4dkDt2uEjlLUlw0Wee
 Xahd610tfG8YYTCecbEvqtgy9RSy3TI19sMKM86GD3IJCo+LVmJ8g475A8FkglCF
 8xhogK8Px2LqxwVOX4wtm/o7hP6wtzwjScapAVN6TPx2t++Ab7WzpMJEpiy4gGFh
 u7BVoS9SbVjQU4tlonXEncGm1Bj3qw0UmdW3H9VkCnlIQUYMR88b3KeZhR4cZUDv
 SPhvY5JEn94F80B6Bbm926eOBeRAwtqRezW5er3kzCZA3m0RlOMicwf2UsLlQd2a
 cifoD3af5d3KJZzMVUX3uO8G4ArT7qJBtS4CeKA1U7TUbrPT9DecGILKdV61/E2L
 +Fg05QYe+Xyh0cI/K6nsdrnLkVCFq7uAT/6TyW6RNl4SJHf0Wli9qsdkDJIN6QQw
 q2hy8GOJvDk3sFr7W3C51uFnuBHU0uYC2cQp5M3e4zG1UCfRADg=
 =Aw5O
 -----END PGP SIGNATURE-----

Merge tag 'v5.4.139' into 5.4-2.3.x-imx

This is the 5.4.139 stable release

Signed-off-by: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>
2021-08-16 11:40:18 +00:00
Masami Hiramatsu 396f29ea0c tracing: Reject string operand in the histogram expression
commit a9d10ca4986571bffc19778742d508cc8dd13e02 upstream.

Since the string type can not be the target of the addition / subtraction
operation, it must be rejected. Without this fix, the string type silently
converted to digits.

Link: https://lkml.kernel.org/r/162742654278.290973.1523000673366456634.stgit@devnote2

Cc: stable@vger.kernel.org
Fixes: 100719dcef ("tracing: Add simple expression support to hist triggers")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-15 13:08:02 +02:00
Thomas Gleixner 42ac2c6348 timers: Move clearing of base::timer_running under base:: Lock
commit bb7262b295472eb6858b5c49893954794027cd84 upstream.

syzbot reported KCSAN data races vs. timer_base::timer_running being set to
NULL without holding base::lock in expire_timers().

This looks innocent and most reads are clearly not problematic, but
Frederic identified an issue which is:

 int data = 0;

 void timer_func(struct timer_list *t)
 {
    data = 1;
 }

 CPU 0                                            CPU 1
 ------------------------------                   --------------------------
 base = lock_timer_base(timer, &flags);           raw_spin_unlock(&base->lock);
 if (base->running_timer != timer)                call_timer_fn(timer, fn, baseclk);
   ret = detach_if_pending(timer, base, true);    base->running_timer = NULL;
 raw_spin_unlock_irqrestore(&base->lock, flags);  raw_spin_lock(&base->lock);

 x = data;

If the timer has previously executed on CPU 1 and then CPU 0 can observe
base->running_timer == NULL and returns, assuming the timer has completed,
but it's not guaranteed on all architectures. The comment for
del_timer_sync() makes that guarantee. Moving the assignment under
base->lock prevents this.

For non-RT kernel it's performance wise completely irrelevant whether the
store happens before or after taking the lock. For an RT kernel moving the
store under the lock requires an extra unlock/lock pair in the case that
there is a waiter for the timer, but that's not the end of the world.

Reported-by: syzbot+aa7c2385d46c5eba0b89@syzkaller.appspotmail.com
Reported-by: syzbot+abea4558531bae1ba9fe@syzkaller.appspotmail.com
Fixes: 030dcdd197 ("timers: Prepare support for PREEMPT_RT")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/87lfea7gw8.fsf@nanos.tec.linutronix.de
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-12 13:21:03 +02:00
Steven Rostedt (VMware) 7da261e6bb tracing / histogram: Give calculation hist_fields a size
commit 2c05caa7ba8803209769b9e4fe02c38d77ae88d0 upstream.

When working on my user space applications, I found a bug in the synthetic
event code where the automated synthetic event field was not matching the
event field calculation it was attached to. Looking deeper into it, it was
because the calculation hist_field was not given a size.

The synthetic event fields are matched to their hist_fields either by
having the field have an identical string type, or if that does not match,
then the size and signed values are used to match the fields.

The problem arose when I tried to match a calculation where the fields
were "unsigned int". My tool created a synthetic event of type "u32". But
it failed to match. The string was:

  diff=field1-field2:onmatch(event).trace(synth,$diff)

Adding debugging into the kernel, I found that the size of "diff" was 0.
And since it was given "unsigned int" as a type, the histogram fallback
code used size and signed. The signed matched, but the size of u32 (4) did
not match zero, and the event failed to be created.

This can be worse if the field you want to match is not one of the
acceptable fields for a synthetic event. As event fields can have any type
that is supported in Linux, this can cause an issue. For example, if a
type is an enum. Then there's no way to use that with any calculations.

Have the calculation field simply take on the size of what it is
calculating.

Link: https://lkml.kernel.org/r/20210730171951.59c7743f@oasis.local.home

Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Fixes: 100719dcef ("tracing: Add simple expression support to hist triggers")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-12 13:21:00 +02:00
Daniel Borkmann fd568de580 bpf: Fix leakage under speculation on mispredicted branches
commit 9183671af6dbf60a1219371d4ed73e23f43b49db upstream

The verifier only enumerates valid control-flow paths and skips paths that
are unreachable in the non-speculative domain. And so it can miss issues
under speculative execution on mispredicted branches.

For example, a type confusion has been demonstrated with the following
crafted program:

  // r0 = pointer to a map array entry
  // r6 = pointer to readable stack slot
  // r9 = scalar controlled by attacker
  1: r0 = *(u64 *)(r0) // cache miss
  2: if r0 != 0x0 goto line 4
  3: r6 = r9
  4: if r0 != 0x1 goto line 6
  5: r9 = *(u8 *)(r6)
  6: // leak r9

Since line 3 runs iff r0 == 0 and line 5 runs iff r0 == 1, the verifier
concludes that the pointer dereference on line 5 is safe. But: if the
attacker trains both the branches to fall-through, such that the following
is speculatively executed ...

  r6 = r9
  r9 = *(u8 *)(r6)
  // leak r9

... then the program will dereference an attacker-controlled value and could
leak its content under speculative execution via side-channel. This requires
to mistrain the branch predictor, which can be rather tricky, because the
branches are mutually exclusive. However such training can be done at
congruent addresses in user space using different branches that are not
mutually exclusive. That is, by training branches in user space ...

  A:  if r0 != 0x0 goto line C
  B:  ...
  C:  if r0 != 0x0 goto line D
  D:  ...

... such that addresses A and C collide to the same CPU branch prediction
entries in the PHT (pattern history table) as those of the BPF program's
lines 2 and 4, respectively. A non-privileged attacker could simply brute
force such collisions in the PHT until observing the attack succeeding.

Alternative methods to mistrain the branch predictor are also possible that
avoid brute forcing the collisions in the PHT. A reliable attack has been
demonstrated, for example, using the following crafted program:

  // r0 = pointer to a [control] map array entry
  // r7 = *(u64 *)(r0 + 0), training/attack phase
  // r8 = *(u64 *)(r0 + 8), oob address
  // [...]
  // r0 = pointer to a [data] map array entry
  1: if r7 == 0x3 goto line 3
  2: r8 = r0
  // crafted sequence of conditional jumps to separate the conditional
  // branch in line 193 from the current execution flow
  3: if r0 != 0x0 goto line 5
  4: if r0 == 0x0 goto exit
  5: if r0 != 0x0 goto line 7
  6: if r0 == 0x0 goto exit
  [...]
  187: if r0 != 0x0 goto line 189
  188: if r0 == 0x0 goto exit
  // load any slowly-loaded value (due to cache miss in phase 3) ...
  189: r3 = *(u64 *)(r0 + 0x1200)
  // ... and turn it into known zero for verifier, while preserving slowly-
  // loaded dependency when executing:
  190: r3 &= 1
  191: r3 &= 2
  // speculatively bypassed phase dependency
  192: r7 += r3
  193: if r7 == 0x3 goto exit
  194: r4 = *(u8 *)(r8 + 0)
  // leak r4

As can be seen, in training phase (phase != 0x3), the condition in line 1
turns into false and therefore r8 with the oob address is overridden with
the valid map value address, which in line 194 we can read out without
issues. However, in attack phase, line 2 is skipped, and due to the cache
miss in line 189 where the map value is (zeroed and later) added to the
phase register, the condition in line 193 takes the fall-through path due
to prior branch predictor training, where under speculation, it'll load the
byte at oob address r8 (unknown scalar type at that point) which could then
be leaked via side-channel.

One way to mitigate these is to 'branch off' an unreachable path, meaning,
the current verification path keeps following the is_branch_taken() path
and we push the other branch to the verification stack. Given this is
unreachable from the non-speculative domain, this branch's vstate is
explicitly marked as speculative. This is needed for two reasons: i) if
this path is solely seen from speculative execution, then we later on still
want the dead code elimination to kick in in order to sanitize these
instructions with jmp-1s, and ii) to ensure that paths walked in the
non-speculative domain are not pruned from earlier walks of paths walked in
the speculative domain. Additionally, for robustness, we mark the registers
which have been part of the conditional as unknown in the speculative path
given there should be no assumptions made on their content.

The fix in here mitigates type confusion attacks described earlier due to
i) all code paths in the BPF program being explored and ii) existing
verifier logic already ensuring that given memory access instruction
references one specific data structure.

An alternative to this fix that has also been looked at in this scope was to
mark aux->alu_state at the jump instruction with a BPF_JMP_TAKEN state as
well as direction encoding (always-goto, always-fallthrough, unknown), such
that mixing of different always-* directions themselves as well as mixing of
always-* with unknown directions would cause a program rejection by the
verifier, e.g. programs with constructs like 'if ([...]) { x = 0; } else
{ x = 1; }' with subsequent 'if (x == 1) { [...] }'. For unprivileged, this
would result in only single direction always-* taken paths, and unknown taken
paths being allowed, such that the former could be patched from a conditional
jump to an unconditional jump (ja). Compared to this approach here, it would
have two downsides: i) valid programs that otherwise are not performing any
pointer arithmetic, etc, would potentially be rejected/broken, and ii) we are
required to turn off path pruning for unprivileged, where both can be avoided
in this work through pushing the invalid branch to the verification stack.

The issue was originally discovered by Adam and Ofek, and later independently
discovered and reported as a result of Benedict and Piotr's research work.

Fixes: b2157399cc ("bpf: prevent out-of-bounds speculation")
Reported-by: Adam Morrison <mad@cs.tau.ac.il>
Reported-by: Ofek Kirzner <ofekkir@gmail.com>
Reported-by: Benedict Schlueter <benedict.schlueter@rub.de>
Reported-by: Piotr Krysiuk <piotras@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Reviewed-by: Benedict Schlueter <benedict.schlueter@rub.de>
Reviewed-by: Piotr Krysiuk <piotras@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
[OP: use allow_ptr_leaks instead of bypass_spec_v1]
Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-08 09:04:08 +02:00
Daniel Borkmann d2f790327f bpf: Do not mark insn as seen under speculative path verification
commit fe9a5ca7e370e613a9a75a13008a3845ea759d6e upstream

... in such circumstances, we do not want to mark the instruction as seen given
the goal is still to jmp-1 rewrite/sanitize dead code, if it is not reachable
from the non-speculative path verification. We do however want to verify it for
safety regardless.

With the patch as-is all the insns that have been marked as seen before the
patch will also be marked as seen after the patch (just with a potentially
different non-zero count). An upcoming patch will also verify paths that are
unreachable in the non-speculative domain, hence this extension is needed.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Reviewed-by: Benedict Schlueter <benedict.schlueter@rub.de>
Reviewed-by: Piotr Krysiuk <piotras@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
[OP: - env->pass_cnt is not used in 5.4, so adjust sanitize_mark_insn_seen()
       to assign "true" instead
     - drop sanitize_insn_aux_data() comment changes, as the function is not
       present in 5.4]
Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-08 09:04:08 +02:00