Commit Graph

1304 Commits

Author SHA1 Message Date
Sergey Senozhatsky 28996dbb8a rcu/tree: Handle VM stoppage in stall detection
[ Upstream commit ccfc9dd6914feaa9a81f10f9cce56eb0f7712264 ]

The soft watchdog timer function checks if a virtual machine
was suspended and hence what looks like a lockup in fact
is a false positive.

This is what kvm_check_and_clear_guest_paused() does: it
tests guest PVCLOCK_GUEST_STOPPED (which is set by the host)
and if it's set then we need to touch all watchdogs and bail
out.

Watchdog timer function runs from IRQ, so PVCLOCK_GUEST_STOPPED
check works fine.

There is, however, one more watchdog that runs from IRQ, so
watchdog timer fn races with it, and that watchdog is not aware
of PVCLOCK_GUEST_STOPPED - RCU stall detector.

apic_timer_interrupt()
 smp_apic_timer_interrupt()
  hrtimer_interrupt()
   __hrtimer_run_queues()
    tick_sched_timer()
     tick_sched_handle()
      update_process_times()
       rcu_sched_clock_irq()

This triggers RCU stalls on our devices during VM resume.

If tick_sched_handle()->rcu_sched_clock_irq() runs on a VCPU
before watchdog_timer_fn()->kvm_check_and_clear_guest_paused()
then there is nothing on this VCPU that touches watchdogs and
RCU reads stale gp stall timestamp and new jiffies value, which
makes it think that RCU has stalled.

Make RCU stall watchdog aware of PVCLOCK_GUEST_STOPPED and
don't report RCU stalls when we resume the VM.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15 09:47:26 +02:00
Frederic Weisbecker fd005f53cb srcu: Fix broken node geometry after early ssp init
[ Upstream commit b5befe842e6612cf894cf4a199924ee872d8b7d8 ]

An srcu_struct structure that is initialized before rcu_init_geometry()
will have its srcu_node hierarchy based on CONFIG_NR_CPUS.  Once
rcu_init_geometry() is called, this hierarchy is compressed as needed
for the actual maximum number of CPUs for this system.

Later on, that srcu_struct structure is confused, sometimes referring
to its initial CONFIG_NR_CPUS-based hierarchy, and sometimes instead
to the new num_possible_cpus() hierarchy.  For example, each of its
->mynode fields continues to reference the original leaf rcu_node
structures, some of which might no longer exist.  On the other hand,
srcu_for_each_node_breadth_first() traverses to the new node hierarchy.

There are at least two bad possible outcomes to this:

1) a) A callback enqueued early on an srcu_data structure (call it
      *sdp) is recorded pending on sdp->mynode->srcu_data_have_cbs in
      srcu_funnel_gp_start() with sdp->mynode pointing to a deep leaf
      (say 3 levels).

   b) The grace period ends after rcu_init_geometry() shrinks the
      nodes level to a single one.  srcu_gp_end() walks through the new
      srcu_node hierarchy without ever reaching the old leaves so the
      callback is never executed.

   This is easily reproduced on an 8 CPUs machine with CONFIG_NR_CPUS >= 32
   and "rcupdate.rcu_self_test=1". The srcu_barrier() after early tests
   verification never completes and the boot hangs:

	[ 5413.141029] INFO: task swapper/0:1 blocked for more than 4915 seconds.
	[ 5413.147564]       Not tainted 5.12.0-rc4+ #28
	[ 5413.151927] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	[ 5413.159753] task:swapper/0       state:D stack:    0 pid:    1 ppid:     0 flags:0x00004000
	[ 5413.168099] Call Trace:
	[ 5413.170555]  __schedule+0x36c/0x930
	[ 5413.174057]  ? wait_for_completion+0x88/0x110
	[ 5413.178423]  schedule+0x46/0xf0
	[ 5413.181575]  schedule_timeout+0x284/0x380
	[ 5413.185591]  ? wait_for_completion+0x88/0x110
	[ 5413.189957]  ? mark_held_locks+0x61/0x80
	[ 5413.193882]  ? mark_held_locks+0x61/0x80
	[ 5413.197809]  ? _raw_spin_unlock_irq+0x24/0x50
	[ 5413.202173]  ? wait_for_completion+0x88/0x110
	[ 5413.206535]  wait_for_completion+0xb4/0x110
	[ 5413.210724]  ? srcu_torture_stats_print+0x110/0x110
	[ 5413.215610]  srcu_barrier+0x187/0x200
	[ 5413.219277]  ? rcu_tasks_verify_self_tests+0x50/0x50
	[ 5413.224244]  ? rdinit_setup+0x2b/0x2b
	[ 5413.227907]  rcu_verify_early_boot_tests+0x2d/0x40
	[ 5413.232700]  do_one_initcall+0x63/0x310
	[ 5413.236541]  ? rdinit_setup+0x2b/0x2b
	[ 5413.240207]  ? rcu_read_lock_sched_held+0x52/0x80
	[ 5413.244912]  kernel_init_freeable+0x253/0x28f
	[ 5413.249273]  ? rest_init+0x250/0x250
	[ 5413.252846]  kernel_init+0xa/0x110
	[ 5413.256257]  ret_from_fork+0x22/0x30

2) An srcu_struct structure that is initialized before rcu_init_geometry()
   and used afterward will always have stale rdp->mynode references,
   resulting in callbacks to be missed in srcu_gp_end(), just like in
   the previous scenario.

This commit therefore causes init_srcu_struct_nodes to initialize the
geometry, if needed.  This ensures that the srcu_node hierarchy is
properly built and distributed from the get-go.

Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20 16:10:41 +02:00
Paul E. McKenney 61f6c18fff rcu: Invoke rcu_spawn_core_kthreads() from rcu_spawn_gp_kthread()
[ Upstream commit 8e4b1d2bc198e34b48fc7cc3a3c5a2fcb269e271 ]

Currently, rcu_spawn_core_kthreads() is invoked via an early_initcall(),
which works, except that rcu_spawn_gp_kthread() is also invoked via an
early_initcall() and rcu_spawn_core_kthreads() relies on adjustments to
kthread_prio that are carried out by rcu_spawn_gp_kthread().  There is
no guaranttee of ordering among early_initcall() handlers, and thus no
guarantee that kthread_prio will be properly checked and range-limited
at the time that rcu_spawn_core_kthreads() needs it.

In most cases, this bug is harmless.  After all, the only reason that
rcu_spawn_gp_kthread() adjusts the value of kthread_prio is if the user
specified a nonsensical value for this boot parameter, which experience
indicates is rare.

Nevertheless, a bug is a bug.  This commit therefore causes the
rcu_spawn_core_kthreads() function to be invoked directly from
rcu_spawn_gp_kthread() after any needed adjustments to kthread_prio have
been carried out.

Fixes: 48d07c04b4 ("rcu: Enable elimination of Tree-RCU softirq processing")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14 16:53:36 +02:00
Frederic Weisbecker 337bba09d8 rcu/nocb: Perform deferred wake up before last idle's need_resched() check
commit 43789ef3f7d61aa7bed0cb2764e588fc990c30ef upstream.

Entering RCU idle mode may cause a deferred wake up of an RCU NOCB_GP
kthread (rcuog) to be serviced.

Usually a local wake up happening while running the idle task is handled
in one of the need_resched() checks carefully placed within the idle
loop that can break to the scheduler.

Unfortunately the call to rcu_idle_enter() is already beyond the last
generic need_resched() check and we may halt the CPU with a resched
request unhandled, leaving the task hanging.

Fix this with splitting the rcuog wakeup handling from rcu_idle_enter()
and place it before the last generic need_resched() check in the idle
loop. It is then assumed that no call to call_rcu() will be performed
after that in the idle loop until the CPU is put in low power mode.

Fixes: 96d3fd0d31 (rcu: Break call_rcu() deadlock involving scheduler and perf)
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-3-frederic@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 10:26:47 +01:00
Frederic Weisbecker 3cd9a74bea rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers
commit 54b7429efffc99e845ba9381bee3244f012a06c2 upstream.

Deferred wakeup of rcuog kthreads upon RCU idle mode entry is going to
be handled differently whether initiated by idle, user or guest. Prepare
with pulling that control up to rcu_eqs_enter() callers.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-2-frederic@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 10:26:46 +01:00
Will Deacon 54d11983c2 arm64: smp: Tell RCU about CPUs that fail to come online
[ Upstream commit 04e613ded8c26489b3e0f9101b44462f780d1a35 ]

Commit ce3d31ad3cac ("arm64/smp: Move rcu_cpu_starting() earlier") ensured
that RCU is informed early about incoming CPUs that might end up calling
into printk() before they are online. However, if such a CPU fails the
early CPU feature compatibility checks in check_local_cpu_capabilities(),
then it will be powered off or parked without informing RCU, leading to
an endless stream of stalls:

  | rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  | rcu:	2-O...: (0 ticks this GP) idle=002/1/0x4000000000000000 softirq=0/0 fqs=2593
  | (detected by 0, t=5252 jiffies, g=9317, q=136)
  | Task dump for CPU 2:
  | task:swapper/2       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x00000028
  | Call trace:
  | ret_from_fork+0x0/0x30

Ensure that the dying CPU invokes rcu_report_dead() prior to being powered
off or parked.

Cc: Qian Cai <cai@redhat.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Suggested-by: Qian Cai <cai@redhat.com>
Link: https://lore.kernel.org/r/20201105222242.GA8842@willie-the-truck
Link: https://lore.kernel.org/r/20201106103602.9849-3-will@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-24 13:29:01 +01:00
Neeraj Upadhyay ef0dcab6d2 rcu: Allow only one expedited GP to run concurrently with wakeups
commit 4bc6b745e5cbefed92c48071e28a5f41246d0470 upstream.

The current expedited RCU grace-period code expects that a task
requesting an expedited grace period cannot awaken until that grace
period has reached the wakeup phase.  However, it is possible for a long
preemption to result in the waiting task never sleeping.  For example,
consider the following sequence of events:

1.	Task A starts an expedited grace period by invoking
	synchronize_rcu_expedited().  It proceeds normally up to the
	wait_event() near the end of that function, and is then preempted
	(or interrupted or whatever).

2.	The expedited grace period completes, and a kworker task starts
	the awaken phase, having incremented the counter and acquired
	the rcu_state structure's .exp_wake_mutex.  This kworker task
	is then preempted or interrupted or whatever.

3.	Task A resumes and enters wait_event(), which notes that the
	expedited grace period has completed, and thus doesn't sleep.

4.	Task B starts an expedited grace period exactly as did Task A,
	complete with the preemption (or whatever delay) just before
	the call to wait_event().

5.	The expedited grace period completes, and another kworker
	task starts the awaken phase, having incremented the counter.
	However, it blocks when attempting to acquire the rcu_state
	structure's .exp_wake_mutex because step 2's kworker task has
	not yet released it.

6.	Steps 4 and 5 repeat, resulting in overflow of the rcu_node
	structure's ->exp_wq[] array.

In theory, this is harmless.  Tasks waiting on the various ->exp_wq[]
array will just be spuriously awakened, but they will just sleep again
on noting that the rcu_state structure's ->expedited_sequence value has
not advanced far enough.

In practice, this wastes CPU time and is an accident waiting to happen.
This commit therefore moves the rcu_exp_gp_seq_end() call that officially
ends the expedited grace period (along with associate tracing) until
after the ->exp_wake_mutex has been acquired.  This prevents Task A from
awakening prematurely, thus preventing more than one expedited grace
period from being in flight during a previous expedited grace period's
wakeup phase.

Fixes: 3b5f668e71 ("rcu: Overlap wakeups with next expedited grace period")
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
[ paulmck: Added updated comment. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-05 16:43:50 +01:00
Marco Elver 255edefeb0 rcu: Fix data-race due to atomic_t copy-by-value
[ Upstream commit 6cf539a87a61a4fbc43f625267dbcbcf283872ed ]

This fixes a data-race where `atomic_t dynticks` is copied by value. The
copy is performed non-atomically, resulting in a data-race if `dynticks`
is updated concurrently.

This data-race was found with KCSAN:
==================================================================
BUG: KCSAN: data-race in dyntick_save_progress_counter / rcu_irq_enter

write to 0xffff989dbdbe98e0 of 4 bytes by task 10 on cpu 3:
 atomic_add_return include/asm-generic/atomic-instrumented.h:78 [inline]
 rcu_dynticks_snap kernel/rcu/tree.c:310 [inline]
 dyntick_save_progress_counter+0x43/0x1b0 kernel/rcu/tree.c:984
 force_qs_rnp+0x183/0x200 kernel/rcu/tree.c:2286
 rcu_gp_fqs kernel/rcu/tree.c:1601 [inline]
 rcu_gp_fqs_loop+0x71/0x880 kernel/rcu/tree.c:1653
 rcu_gp_kthread+0x22c/0x3b0 kernel/rcu/tree.c:1799
 kthread+0x1b5/0x200 kernel/kthread.c:255
 <snip>

read to 0xffff989dbdbe98e0 of 4 bytes by task 154 on cpu 7:
 rcu_nmi_enter_common kernel/rcu/tree.c:828 [inline]
 rcu_irq_enter+0xda/0x240 kernel/rcu/tree.c:870
 irq_enter+0x5/0x50 kernel/softirq.c:347
 <snip>

Reported by Kernel Concurrency Sanitizer on:
CPU: 7 PID: 154 Comm: kworker/7:1H Not tainted 5.3.0+ #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Workqueue: kblockd blk_mq_run_work_fn
==================================================================

Signed-off-by: Marco Elver <elver@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: rcu@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24 08:36:23 +01:00
Neeraj Upadhyay b7725deb9d rcu: Fix missed wakeup of exp_wq waiters
[ Upstream commit fd6bc19d7676a060a171d1cf3dcbf6fd797eb05f ]

Tasks waiting within exp_funnel_lock() for an expedited grace period to
elapse can be starved due to the following sequence of events:

1.	Tasks A and B both attempt to start an expedited grace
	period at about the same time.	This grace period will have
	completed when the lower four bits of the rcu_state structure's
	->expedited_sequence field are 0b'0100', for example, when the
	initial value of this counter is zero.	Task A wins, and thus
	does the actual work of starting the grace period, including
	acquiring the rcu_state structure's .exp_mutex and sets the
	counter to 0b'0001'.

2.	Because task B lost the race to start the grace period, it
	waits on ->expedited_sequence to reach 0b'0100' inside of
	exp_funnel_lock(). This task therefore blocks on the rcu_node
	structure's ->exp_wq[1] field, keeping in mind that the
	end-of-grace-period value of ->expedited_sequence (0b'0100')
	is shifted down two bits before indexing the ->exp_wq[] field.

3.	Task C attempts to start another expedited grace period,
	but blocks on ->exp_mutex, which is still held by Task A.

4.	The aforementioned expedited grace period completes, so that
	->expedited_sequence now has the value 0b'0100'.  A kworker task
	therefore acquires the rcu_state structure's ->exp_wake_mutex
	and starts awakening any tasks waiting for this grace period.

5.	One of the first tasks awakened happens to be Task A.  Task A
	therefore releases the rcu_state structure's ->exp_mutex,
	which allows Task C to start the next expedited grace period,
	which causes the lower four bits of the rcu_state structure's
	->expedited_sequence field to become 0b'0101'.

6.	Task C's expedited grace period completes, so that the lower four
	bits of the rcu_state structure's ->expedited_sequence field now
	become 0b'1000'.

7.	The kworker task from step 4 above continues its wakeups.
	Unfortunately, the wake_up_all() refetches the rcu_state
	structure's .expedited_sequence field:

	wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);

	This results in the wakeup being applied to the rcu_node
	structure's ->exp_wq[2] field, which is unfortunate given that
	Task B is instead waiting on ->exp_wq[1].

On a busy system, no harm is done (or at least no permanent harm is done).
Some later expedited grace period will redo the wakeup.  But on a quiet
system, such as many embedded systems, it might be a good long time before
there was another expedited grace period.  On such embedded systems,
this situation could therefore result in a system hang.

This issue manifested as DPM device timeout during suspend (which
usually qualifies as a quiet time) due to a SCSI device being stuck in
_synchronize_rcu_expedited(), with the following stack trace:

	schedule()
	synchronize_rcu_expedited()
	synchronize_rcu()
	scsi_device_quiesce()
	scsi_bus_suspend()
	dpm_run_callback()
	__device_suspend()

This commit therefore prevents such delays, timeouts, and hangs by
making rcu_exp_wait_wake() use its "s" argument consistently instead of
refetching from rcu_state.expedited_sequence.

Fixes: 3b5f668e71 ("rcu: Overlap wakeups with next expedited grace period")
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24 08:36:22 +01:00
Stefan Reiter 3ece067c12 rcu/nocb: Fix dump_tree hierarchy print always active
[ Upstream commit 610dea36d3083a977e4f156206cbe1eaa2a532f0 ]

Commit 18cd8c93e6 ("rcu/nocb: Print gp/cb kthread hierarchy if
dump_tree") added print statements to rcu_organize_nocb_kthreads for
debugging, but incorrectly guarded them, causing the function to always
spew out its message.

This patch fixes it by guarding both pr_alert statements with dump_tree,
while also changing the second pr_alert to a pr_cont, to print the
hierarchy in a single line (assuming that's how it was supposed to
work).

Fixes: 18cd8c93e6 ("rcu/nocb: Print gp/cb kthread hierarchy if dump_tree")
Signed-off-by: Stefan Reiter <stefan@pimaker.at>
[ paulmck: Make single-nocbs-CPU GP kthreads look less erroneous. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24 08:36:22 +01:00
Paul E. McKenney 6a4fea54ab rcu: Use READ_ONCE() for ->expmask in rcu_read_unlock_special()
commit c51f83c315c392d9776c33eb16a2fe1349d65c7f upstream.

The rcu_node structure's ->expmask field is updated only when holding the
->lock, but is also accessed locklessly.  This means that all ->expmask
updates must use WRITE_ONCE() and all reads carried out without holding
->lock must use READ_ONCE().  This commit therefore changes the lockless
->expmask read in rcu_read_unlock_special() to use READ_ONCE().

Reported-by: syzbot+99f4ddade3c22ab0cf23@syzkaller.appspotmail.com
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-11 04:35:08 -08:00
Paul E. McKenney c71706a5ff srcu: Apply *_ONCE() to ->srcu_last_gp_end
commit 844a378de3372c923909681706d62336d702531e upstream.

The ->srcu_last_gp_end field is accessed from any CPU at any time
by synchronize_srcu(), so non-initialization references need to use
READ_ONCE() and WRITE_ONCE().  This commit therefore makes that change.

Reported-by: syzbot+08f3e9d26e5541e1ecf2@syzkaller.appspotmail.com
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-11 04:35:08 -08:00
Eric Dumazet dcad7270b2 rcu: Avoid data-race in rcu_gp_fqs_check_wake()
commit 6935c3983b246d5fbfebd3b891c825e65c118f2d upstream.

The rcu_gp_fqs_check_wake() function uses rcu_preempt_blocked_readers_cgp()
to read ->gp_tasks while other cpus might overwrite this field.

We need READ_ONCE()/WRITE_ONCE() pairs to avoid compiler
tricks and KCSAN splats like the following :

BUG: KCSAN: data-race in rcu_gp_fqs_check_wake / rcu_preempt_deferred_qs_irqrestore

write to 0xffffffff85a7f190 of 8 bytes by task 7317 on cpu 0:
 rcu_preempt_deferred_qs_irqrestore+0x43d/0x580 kernel/rcu/tree_plugin.h:507
 rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659
 __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394
 rcu_read_unlock include/linux/rcupdate.h:645 [inline]
 __ip_queue_xmit+0x3b0/0xa40 net/ipv4/ip_output.c:533
 ip_queue_xmit+0x45/0x60 include/net/ip.h:236
 __tcp_transmit_skb+0xdeb/0x1cd0 net/ipv4/tcp_output.c:1158
 __tcp_send_ack+0x246/0x300 net/ipv4/tcp_output.c:3685
 tcp_send_ack+0x34/0x40 net/ipv4/tcp_output.c:3691
 tcp_cleanup_rbuf+0x130/0x360 net/ipv4/tcp.c:1575
 tcp_recvmsg+0x633/0x1a30 net/ipv4/tcp.c:2179
 inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
 sock_recvmsg_nosec net/socket.c:871 [inline]
 sock_recvmsg net/socket.c:889 [inline]
 sock_recvmsg+0x92/0xb0 net/socket.c:885
 sock_read_iter+0x15f/0x1e0 net/socket.c:967
 call_read_iter include/linux/fs.h:1864 [inline]
 new_sync_read+0x389/0x4f0 fs/read_write.c:414

read to 0xffffffff85a7f190 of 8 bytes by task 10 on cpu 1:
 rcu_gp_fqs_check_wake kernel/rcu/tree.c:1556 [inline]
 rcu_gp_fqs_check_wake+0x93/0xd0 kernel/rcu/tree.c:1546
 rcu_gp_fqs_loop+0x36c/0x580 kernel/rcu/tree.c:1611
 rcu_gp_kthread+0x143/0x220 kernel/rcu/tree.c:1768
 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 10 Comm: rcu_preempt Not tainted 5.3.0+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
[ paulmck:  Added another READ_ONCE() for RCU CPU stall warnings. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-11 04:35:08 -08:00
Paul E. McKenney a523031513 rcu: Use *_ONCE() to protect lockless ->expmask accesses
commit 15c7c972cd26d89a26788e609c53b5a465324a6c upstream.

The rcu_node structure's ->expmask field is accessed locklessly when
starting a new expedited grace period and when reporting an expedited
RCU CPU stall warning.  This commit therefore handles the former by
taking a snapshot of ->expmask while the lock is held and the latter
by applying READ_ONCE() to lockless reads and WRITE_ONCE() to the
corresponding updates.

Link: https://lore.kernel.org/lkml/CANpmjNNmSOagbTpffHr4=Yedckx9Rm2NuGqC9UqE+AOz5f1-ZQ@mail.gmail.com
Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-11 04:35:08 -08:00
Dan Carpenter e4e32af72d rcu: Fix uninitialized variable in nocb_gp_wait()
[ Upstream commit b8889c9c89a2655a231dfed93cc9bdca0930ea67 ]

We never set this to false.  This probably doesn't affect most people's
runtime because GCC will automatically initialize it to false at certain
common optimization levels.  But that behavior is related to a bug in
GCC and obviously should not be relied on.

Fixes: 5d6742b377 ("rcu/nocb: Use rcu_segcblist for no-CBs CPUs")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-26 10:01:05 +01:00
Linus Torvalds 7e67a85999 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:

 - MAINTAINERS: Add Mark Rutland as perf submaintainer, Juri Lelli and
   Vincent Guittot as scheduler submaintainers. Add Dietmar Eggemann,
   Steven Rostedt, Ben Segall and Mel Gorman as scheduler reviewers.

   As perf and the scheduler is getting bigger and more complex,
   document the status quo of current responsibilities and interests,
   and spread the review pain^H^H^H^H fun via an increase in the Cc:
   linecount generated by scripts/get_maintainer.pl. :-)

 - Add another series of patches that brings the -rt (PREEMPT_RT) tree
   closer to mainline: split the monolithic CONFIG_PREEMPT dependencies
   into a new CONFIG_PREEMPTION category that will allow the eventual
   introduction of CONFIG_PREEMPT_RT. Still a few more hundred patches
   to go though.

 - Extend the CPU cgroup controller with uclamp.min and uclamp.max to
   allow the finer shaping of CPU bandwidth usage.

 - Micro-optimize energy-aware wake-ups from O(CPUS^2) to O(CPUS).

 - Improve the behavior of high CPU count, high thread count
   applications running under cpu.cfs_quota_us constraints.

 - Improve balancing with SCHED_IDLE (SCHED_BATCH) tasks present.

 - Improve CPU isolation housekeeping CPU allocation NUMA locality.

 - Fix deadline scheduler bandwidth calculations and logic when cpusets
   rebuilds the topology, or when it gets deadline-throttled while it's
   being offlined.

 - Convert the cpuset_mutex to percpu_rwsem, to allow it to be used from
   setscheduler() system calls without creating global serialization.
   Add new synchronization between cpuset topology-changing events and
   the deadline acceptance tests in setscheduler(), which were broken
   before.

 - Rework the active_mm state machine to be less confusing and more
   optimal.

 - Rework (simplify) the pick_next_task() slowpath.

 - Improve load-balancing on AMD EPYC systems.

 - ... and misc cleanups, smaller fixes and improvements - please see
   the Git log for more details.

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
  sched/psi: Correct overly pessimistic size calculation
  sched/fair: Speed-up energy-aware wake-ups
  sched/uclamp: Always use 'enum uclamp_id' for clamp_id values
  sched/uclamp: Update CPU's refcount on TG's clamp changes
  sched/uclamp: Use TG's clamps to restrict TASK's clamps
  sched/uclamp: Propagate system defaults to the root group
  sched/uclamp: Propagate parent clamps
  sched/uclamp: Extend CPU's cgroup controller
  sched/topology: Improve load balancing on AMD EPYC systems
  arch, ia64: Make NUMA select SMP
  sched, perf: MAINTAINERS update, add submaintainers and reviewers
  sched/fair: Use rq_lock/unlock in online_fair_sched_group
  cpufreq: schedutil: fix equation in comment
  sched: Rework pick_next_task() slow-path
  sched: Allow put_prev_task() to drop rq->lock
  sched/fair: Expose newidle_balance()
  sched: Add task_struct pointer to sched_class::set_curr_task
  sched: Rework CPU hotplug task selection
  sched/{rt,deadline}: Fix set_next_task vs pick_next_task
  sched: Fix kerneldoc comment for ia64_set_curr_task
  ...
2019-09-16 17:25:49 -07:00
Ingo Molnar 563c4f85f9 Merge branch 'sched/rt' into sched/core, to pick up -rt changes
Pick up the first couple of patches working towards PREEMPT_RT.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-16 14:05:04 +02:00
Eric Dumazet cfcdef5e30 rcu: Allow rcu_do_batch() to dynamically adjust batch sizes
Bimodal behavior of rcu_do_batch() is not really suited to Google
applications like gfe servers.

When a process with millions of sockets exits, closing all files
queues two rcu callbacks per socket.

This eventually reaches the point where RCU enters an emergency
mode, where rcu_do_batch() do not return until whole queue is flushed.

Each rcu callback lasts at least 70 nsec, so with millions of
elements, we easily spend more than 100 msec without rescheduling.

Goal of this patch is to avoid the infamous message like following
"need_resched set for > 51999388 ns (52 ticks) without schedule"

We dynamically adjust the number of elements we process, instead
of 10 / INFINITE choices, we use a floor of ~1 % of current entries.

If the number is above 1000, we switch to a time based limit of 3 msec
per batch, adjustable with /sys/module/rcutree/parameters/rcu_resched_ns

Signed-off-by: Eric Dumazet <edumazet@google.com>
[ paulmck: Forward-port and remove debug statements. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:38:24 -07:00
Paul E. McKenney f48fe4c586 rcu/nocb: Don't wake no-CBs GP kthread if timer posted under overload
When under overload conditions, __call_rcu_nocb_wake() will wake the
no-CBs GP kthread any time the no-CBs CB kthread is asleep or there
are no ready-to-invoke callbacks, but only after a timer delay.  If the
no-CBs GP kthread has a ->nocb_bypass_timer pending, the deferred wakeup
from __call_rcu_nocb_wake() is redundant.  This commit therefore makes
__call_rcu_nocb_wake() avoid posting the redundant deferred wakeup if
->nocb_bypass_timer is pending.  This requires adding a bit of ordering
of timer actions.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:38:24 -07:00
Paul E. McKenney 296181d78d rcu/nocb: Reduce __call_rcu_nocb_wake() leaf rcu_node ->lock contention
Currently, __call_rcu_nocb_wake() advances callbacks each time that it
detects excessive numbers of callbacks, though only if it succeeds in
conditionally acquiring its leaf rcu_node structure's ->lock.  Despite
the conditional acquisition of ->lock, this does increase contention.
This commit therefore avoids advancing callbacks unless there are
callbacks in ->cblist whose grace period has completed and advancing
has not yet been done during this jiffy.

Note that this decision does not take the presence of new callbacks
into account.  That is because on this code path, there will always be
at least one new callback, namely the one we just enqueued.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:38:24 -07:00
Paul E. McKenney 1d5a81c18d rcu/nocb: Reduce nocb_cb_wait() leaf rcu_node ->lock contention
Currently, nocb_cb_wait() advances callbacks on each pass through its
loop, though only if it succeeds in conditionally acquiring its leaf
rcu_node structure's ->lock.  Despite the conditional acquisition of
->lock, this does increase contention.  This commit therefore avoids
advancing callbacks unless there are callbacks in ->cblist whose grace
period has completed.

Note that nocb_cb_wait() doesn't worry about callbacks that have not
yet been assigned a grace period.  The idea is that the only reason for
nocb_cb_wait() to advance callbacks is to allow it to continue invoking
callbacks.  Time will tell whether this is the correct choice.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:38:24 -07:00
Paul E. McKenney 23651d9b96 rcu/nocb: Advance CBs after merge in rcutree_migrate_callbacks()
The rcutree_migrate_callbacks() invokes rcu_advance_cbs() on both the
offlined CPU's ->cblist and that of the surviving CPU, then merges
them.  However, after the merge, and of the offlined CPU's callbacks
that were not ready to be invoked will no longer be associated with a
grace-period number.  This commit therefore invokes rcu_advance_cbs()
one more time on the merged ->cblist in order to assign a grace-period
number to these callbacks.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:38:24 -07:00
Paul E. McKenney 273f034065 rcu/nocb: Avoid synchronous wakeup in __call_rcu_nocb_wake()
When callbacks are in full flow, the common case is waiting for a
grace period, and this grace period will normally take a few jiffies to
complete.  It therefore isn't all that helpful for __call_rcu_nocb_wake()
to do a synchronous wakeup in this case.  This commit therefore turns this
into a timer-based deferred wakeup of the no-CBs grace-period kthread.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:38:24 -07:00
Paul E. McKenney f7a81b12d6 rcu/nocb: Print no-CBs diagnostics when rcutorture writer unduly delayed
This commit causes locking, sleeping, and callback state to be printed
for no-CBs CPUs when the rcutorture writer is delayed sufficiently for
rcutorture to complain.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:38:24 -07:00
Paul E. McKenney 6aacd88d17 rcu/nocb: EXP Check use and usefulness of ->nocb_lock_contended
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:38:24 -07:00
Paul E. McKenney d1b222c6be rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations.  However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking.  Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.

This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure.  Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ).  When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.

The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first.  During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations.  However,
a CPU could simply stop invoking call_rcu() at any time.  The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first).  This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass.  A ->nocb_bypass_timer is
used to provide the needed wakeups.

[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:37:32 -07:00
Paul E. McKenney eda669a6a2 rcu/nocb: Atomic ->len field in rcu_segcblist structure
Upcoming ->nocb_lock contention-reduction work requires that the
rcu_segcblist structure's ->len field be concurrently manipulated,
but only if there are no-CBs CPUs in the kernel.  This commit
therefore makes this ->len field be an atomic_long_t, but only
in CONFIG_RCU_NOCB_CPU=y kernels.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney faca5c2509 rcu/nocb: Unconditionally advance and wake for excessive CBs
When there are excessive numbers of callbacks, and when either the
corresponding no-CBs callback kthread is asleep or there is no more
ready-to-invoke callbacks, and when least one callback is pending,
__call_rcu_nocb_wake() will advance the callbacks, but refrain from
awakening the corresponding no-CBs grace-period kthread.  However,
because rcu_advance_cbs_nowake() is used, it is possible (if a bit
unlikely) that the needed advancement could not happen due to a grace
period not being in progress.  Plus there will always be at least one
pending callback due to one having just now been enqueued.

This commit therefore attempts to advance callbacks and awakens the
no-CBs grace-period kthread when there are excessive numbers of callbacks
posted and when the no-CBs callback kthread is not in a position to do
anything helpful.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney 4fd8c5f153 rcu/nocb: Reduce ->nocb_lock contention with separate ->nocb_gp_lock
The sleep/wakeup of the no-CBs grace-period kthreads is synchronized
using the ->nocb_lock of the first CPU corresponding to that kthread.
This commit provides a separate ->nocb_gp_lock for this purpose, thus
reducing contention on ->nocb_lock.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney 523bddd553 rcu/nocb: Reduce contention at no-CBs invocation-done time
Currently, nocb_cb_wait() unconditionally acquires the leaf rcu_node
->lock to advance callbacks when done invoking the previous batch.
It does this while holding ->nocb_lock, which means that contention on
the leaf rcu_node ->lock visits itself on the ->nocb_lock.  This commit
therefore makes this lock acquisition conditional, forgoing callback
advancement when the leaf rcu_node ->lock is not immediately available.
(In this case, the no-CBs grace-period kthread will eventually do any
needed callback advancement.)

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney 6608c3a027 rcu/nocb: Reduce contention at no-CBs registry-time CB advancement
Currently, __call_rcu_nocb_wake() conditionally acquires the leaf rcu_node
structure's ->lock, and only afterwards does rcu_advance_cbs_nowake()
check to see if it is possible to advance callbacks without potentially
needing to awaken the grace-period kthread.  Given that the no-awaken
check can be done locklessly, this commit reverses the order, so that
rcu_advance_cbs_nowake() is invoked without holding the leaf rcu_node
structure's ->lock and rcu_advance_cbs_nowake() checks the grace-period
state before conditionally acquiring that lock, thus reducing the number
of needless acquistions of the leaf rcu_node structure's ->lock.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney 9fcb09bddd rcu/nocb: Round down for number of no-CBs grace-period kthreads
Currently, when the square root of the number of CPUs is rounded down
by int_sqrt(), this round-down is applied to the number of callback
kthreads per grace-period kthreads.  This makes almost no difference
for large systems, but results in oddities such as three no-CBs
grace-period kthreads for a five-CPU system, which is a bit excessive.
This commit therefore causes the round-down to apply to the number of
no-CBs grace-period kthreads, so that systems with from four to eight
CPUs have only two no-CBs grace period kthreads.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney 81c0b3d724 rcu/nocb: Avoid ->nocb_lock capture by corresponding CPU
A given rcu_data structure's ->nocb_lock can be acquired very frequently
by the corresponding CPU and occasionally by the corresponding no-CBs
grace-period and callbacks kthreads.  In particular, these two kthreads
will have frequent gaps between ->nocb_lock acquisitions that are roughly
a grace period in duration.  This means that any excessive ->nocb_lock
contention will be due to the CPU's acquisitions, and this in turn
enables a very naive contention-avoidance strategy to be quite effective.

This commit therefore modifies rcu_nocb_lock() to first
attempt a raw_spin_trylock(), and to atomically increment a
separate ->nocb_lock_contended across a raw_spin_lock().  This new
->nocb_lock_contended field is checked in __call_rcu_nocb_wake() when
interrupts are enabled, with a spin-wait for contending acquisitions
to complete, thus allowing the kthreads a chance to acquire the lock.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney 7f36ef82e5 rcu/nocb: Avoid needless wakeups of no-CBs grace-period kthread
Currently, the code provides an extra wakeup for the no-CBs grace-period
kthread if one of its CPUs is generating excessive numbers of callbacks.
But satisfying though it is to wake something up when things are going
south, unless the thing being awakened can actually help solve the
problem, that extra wakeup does nothing but consume additional CPU time,
which is exactly what you don't want during a call_rcu() flood.

This commit therefore avoids doing anything if the corresponding
no-CBs callback kthread is going full tilt.  Otherwise, if advancing
callbacks immediately might help and if the leaf rcu_node structure's
lock is immediately available, this commit invokes a new variant of
rcu_advance_cbs() that advances callbacks only if doing so won't require
awakening the grace-period kthread (not to be confused with any of the
no-CBs grace-period kthreads).

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney ce0a825e40 rcu/nocb: Make __call_rcu_nocb_wake() safe for many callbacks
It might be hard to imagine having more than two billion callbacks
queued on a single CPU's ->cblist, but someone will do it sometime.
This commit therefore makes __call_rcu_nocb_wake() handle this situation
by upgrading local variable "len" from "int" to "long".

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney 383e133283 rcu/nocb: Never downgrade ->nocb_defer_wakeup in wake_nocb_gp_defer()
Currently, wake_nocb_gp_defer() simply stores whatever waketype was
passed in, which can result in a RCU_NOCB_WAKE_FORCE being downgraded
to RCU_NOCB_WAKE, which could in turn delay callback processing.
This commit therefore adds a check so that wake_nocb_gp_defer() only
updates ->nocb_defer_wakeup when the update increases the forcefulness,
thus avoiding downgrades.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney aeeacd9d84 rcu/nocb: Enable re-awakening under high callback load
The __call_rcu_nocb_wake() function and its predecessors set
->qlen_last_fqs_check to zero for the first callback and to LONG_MAX / 2
for forced reawakenings.  The former can result in a too-quick reawakening
when there are many callbacks ready to invoke and the latter prevents a
second reawakening.  This commit therefore sets ->qlen_last_fqs_check
to the current number of callbacks in both cases.  While in the area,
this commit also moves both assignments under ->nocb_lock.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney 0bd55c6936 rcu/nohz: Turn off tick for offloaded CPUs
Historically, no-CBs CPUs allowed the scheduler-clock tick to be
unconditionally disabled on any transition to idle or nohz_full userspace
execution (see the rcu_needs_cpu() implementations).  Unfortunately,
the checks used by rcu_needs_cpu() are defeated now that no-CBs CPUs
use ->cblist, which might make users of battery-powered devices rather
unhappy.  This commit therefore adds explicit rcu_segcblist_is_offloaded()
checks to return to the historical energy-efficient semantics.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney 969974e5c5 rcu/nocb: Suppress uninitialized false-positive in nocb_gp_wait()
Some compilers complain that wait_gp_seq might be used uninitialized
in nocb_gp_wait().  This cannot actually happen because when wait_gp_seq
is uninitialized, needwait_gp must be false, which prevents wait_gp_seq
from being used.  But this analysis is apparently beyond some compilers,
so this commit adds a bogus initialization of wait_gp_seq for the sole
purpose of suppressing the false-positive warning.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney 921bb5fad1 rcu/nocb: Use build-time no-CBs check in rcu_pending()
Currently, rcu_pending() invokes rcu_segcblist_is_offloaded() even
in CONFIG_RCU_NOCB_CPU=n kernels, which cannot possibly be offloaded.
Given that rcu_pending() is on a fastpath, it makes sense to check for
CONFIG_RCU_NOCB_CPU=y before invoking rcu_segcblist_is_offloaded().
This commit therefore makes this change.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney c1ab99d66e rcu/nocb: Use build-time no-CBs check in rcu_core()
Currently, rcu_core() invokes rcu_segcblist_is_offloaded() each time it
needs to know whether the current CPU is a no-CBs CPU.  Given that it is
not possible to change the no-CBs status of a CPU after boot, and given
that it is not possible to even have no-CBs CPUs in CONFIG_RCU_NOCB_CPU=n
kernels, this repeated runtime invocation wastes CPU.  This commit
therefore created a const on-stack variable to allow this check to be
done only once per rcu_core() invocation.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney ec5ef87bac rcu/nocb: Use build-time no-CBs check in rcu_do_batch()
Currently, rcu_do_batch() invokes rcu_segcblist_is_offloaded() each time
it needs to know whether the current CPU is a no-CBs CPU.  Given that it
is not possible to change the no-CBs status of a CPU after boot, and given
that it is not possible to even have no-CBs CPUs in CONFIG_RCU_NOCB_CPU=n
kernels, this per-callback invocation wastes CPU.  This commit therefore
created a const on-stack variable to allow this check to be done only
once per rcu_do_batch() invocation.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney 4f9c1bc727 rcu/nocb: Remove obsolete nocb_gp_head and nocb_gp_tail fields
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney 2a777de757 rcu/nocb: Remove obsolete nocb_cb_tail and nocb_cb_head fields
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney c035280f17 rcu/nocb: Remove obsolete nocb_q_count and nocb_q_count_lazy fields
This commit removes the obsolete nocb_q_count and nocb_q_count_lazy
fields, also removing rcu_get_n_cbs_nocb_cpu(), adjusting
rcu_get_n_cbs_cpu(), and making rcutree_migrate_callbacks() once again
disable the ->cblist fields of offline CPUs.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney e7f4c5b399 rcu/nocb: Remove obsolete nocb_head and nocb_tail fields
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney 5d6742b377 rcu/nocb: Use rcu_segcblist for no-CBs CPUs
Currently the RCU callbacks for no-CBs CPUs are queued on a series of
ad-hoc linked lists, which means that these callbacks cannot benefit
from "drive-by" grace periods, thus suffering needless delays prior
to invocation.  In addition, the no-CBs grace-period kthreads first
wait for callbacks to appear and later wait for a new grace period,
which means that callbacks appearing during a grace-period wait can
be delayed.  These delays increase memory footprint, and could even
result in an out-of-memory condition.

This commit therefore enqueues RCU callbacks from no-CBs CPUs on the
rcu_segcblist structure that is already used by non-no-CBs CPUs.  It also
restructures the no-CBs grace-period kthread to be checking for incoming
callbacks while waiting for grace periods.  Also, instead of waiting
for a new grace period, it waits for the closest grace period that will
cause some of the callbacks to be safe to invoke.  All of these changes
reduce callback latency and thus the number of outstanding callbacks,
in turn reducing the probability of an out-of-memory condition.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney e83e73f5b0 rcu/nocb: Leave ->cblist enabled for no-CBs CPUs
As a first step towards making no-CBs CPUs use the ->cblist, this commit
leaves the ->cblist enabled for these CPUs.  The main reason to make
no-CBs CPUs use ->cblist is to take advantage of callback numbering,
which will reduce the effects of missed grace periods which in turn will
reduce forward-progress problems for no-CBs CPUs.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney e6060b41c9 rcu/nocb: Allow lockless use of rcu_segcblist_empty()
Currently, rcu_segcblist_empty() assumes that the callback list is not
being changed by other CPUs, but upcoming changes will require it to
operate locklessly.  This commit therefore adds the needed READ_ONCE()
call, along with the WRITE_ONCE() calls when updating the callback list's
->head field.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney 76c6927c3e rcu/nocb: Allow lockless use of rcu_segcblist_restempty()
Currently, rcu_segcblist_restempty() assumes that the callback list
is not being changed by other CPUs, but upcoming changes will require
it to operate locklessly.  This commit therefore adds the needed
READ_ONCE() calls, along with the WRITE_ONCE() calls when updating
the callback list.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00