linux-brain/Documentation
Dave Chiluk de53fd7aed sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota. This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to cpu-local run queue being allocated per cpu
bandwidth slices, and then not fully using that slice within the period.
At which point the slice and quota expires. This expiration of unused
slice results in applications not being able to utilize the quota for
which they are allocated.

The non-expiration of per-cpu slices was recently fixed by
'commit 512ac999d2 ("sched/fair: Fix bandwidth timer clock drift
condition")'. Prior to that it appears that this had been broken since
at least 'commit 51f2176d74 ("sched/fair: Fix unlocked reads of some
cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
added the following conditional which resulted in slices never being
expired.

if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
	/* extend local deadline, drift is bounded above by 2 ticks */
	cfs_rq->runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows quota already allocated to per-cpu run-queues to live longer
than the period boundary. This allows threads on runqueues that do not
use much CPU to continue to use their remaining slice over a longer
period of time than cpu.cfs_period_us. However, this helps prevent the
above condition of hitting throttling while also not fully utilizing
your cpu quota.

This theoretically allows a machine to use slightly more than its
allotted quota in some periods. This overflow would be bounded by the
remaining quota left on each per-cpu runqueueu. This is typically no
more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
change nothing, as they should theoretically fully utilize all of their
quota in each period. For user-interactive tasks as described above this
provides a much better user/application experience as their cpu
utilization will more closely match the amount they requested when they
hit throttling. This means that cpu limits no longer strictly apply per
period for non-cpu bound applications, but that they are still accurate
over longer timeframes.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase (10ms/100ms of quota on
80 CPU machine), this commit resulted in almost 30x performance
improvement, while still maintaining correct cpu quota restrictions.
That testcase is available at https://github.com/indeedeng/fibtest.

Fixes: 512ac999d2 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: John Hammond <jhammond@indeed.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kyle Anderson <kwa@yelp.com>
Cc: Gabriel Munos <gmunoz@netflix.com>
Cc: Peter Oskolkov <posk@posk.io>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Brendan Gregg <bgregg@netflix.com>
Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.com
2019-08-08 09:09:30 +02:00
..
ABI platform/x86: asus: Rename "fan mode" to "fan boost mode" 2019-07-17 19:07:58 +03:00
accounting docs: add some documentation dirs to the driver-api book 2019-07-15 11:03:02 -03:00
acpi/dsd docs: fix broken documentation links 2019-06-08 13:42:13 -06:00
admin-guide xen: fixes and features for 5.3-rc1 2019-07-19 11:41:26 -07:00
arm docs: arm: fix a breakage with pdf output 2019-07-15 11:03:04 -03:00
arm64 docs: add arch doc directories to the index 2019-07-15 11:03:01 -03:00
auxdisplay docs: admin-guide: add a series of orphaned documents 2019-07-15 11:03:02 -03:00
block docs conversion for v5.3-rc1 2019-07-16 12:21:41 -07:00
bpf Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2019-07-11 10:55:49 -07:00
cdrom docs: add some directories to the main documentation index 2019-07-15 11:03:03 -03:00
core-api docs: admin-guide: move sysctl directory to it 2019-07-15 11:03:01 -03:00
cpu-freq docs: power: convert docs to ReST and rename to *.rst 2019-06-14 16:08:36 -05:00
crypto crypto: doc - Fix formatting of new crypto engine content 2019-07-03 22:13:12 +08:00
dev-tools Remove references to dead website. 2019-07-19 12:22:04 -07:00
devicetree dt-bindings: pinctrl: stm32: Fix missing 'clocks' property in examples 2019-07-20 20:28:53 -06:00
doc-guide Doc : doc-guide : Fix a typo 2019-06-28 09:04:14 -06:00
driver-api New feature to add support for NTB virtual MSI interrupts, the ability 2019-07-21 09:46:59 -07:00
EDID docs: driver-api: add a series of orphaned documents 2019-07-15 11:03:02 -03:00
fault-injection docs: add some directories to the main documentation index 2019-07-15 11:03:03 -03:00
fb docs conversion for v5.3-rc1 2019-07-16 12:21:41 -07:00
features Documentation/stackprotector: powerpc supports stack protector 2019-06-14 14:44:43 -06:00
filesystems Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-07-21 10:09:43 -07:00
firmware_class
firmware-guide docs: gpio: add sysfs interface to the admin-guide 2019-07-15 11:03:03 -03:00
fpga docs: add some directories to the main documentation index 2019-07-15 11:03:03 -03:00
gpu drm main pull request for v5.3-rc1 (sans mm changes) 2019-07-15 19:04:27 -07:00
hid docs: add some documentation dirs to the driver-api book 2019-07-15 11:03:02 -03:00
hwmon docs: driver-model: move it to the driver-api book 2019-07-15 11:03:02 -03:00
i2c Merge branch 'i2c/for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux 2019-07-15 21:10:39 -07:00
ia64 docs: add SPDX tags to new index files 2019-07-15 11:03:03 -03:00
ide docs: add some directories to the main documentation index 2019-07-15 11:03:03 -03:00
iio docs: add some documentation dirs to the driver-api book 2019-07-15 11:03:02 -03:00
infiniband docs: infiniband: add it to the driver-api bookset 2019-07-08 14:22:56 -03:00
input docs: hid: convert to ReST 2019-07-02 10:19:34 +02:00
ioctl docs: ioctl: add it to the uAPI guide 2019-07-15 09:20:28 -03:00
isdn isdn: remove isdn4linux 2019-05-31 11:13:10 +02:00
kbuild Kbuild updates for v5.3 (2nd) 2019-07-20 09:34:55 -07:00
kernel-hacking docs: locking: convert docs to ReST and rename to *.rst 2019-07-15 08:53:27 -03:00
leds docs: leds: add it to the driver-api book 2019-07-15 09:20:28 -03:00
livepatch docs: add some directories to the main documentation index 2019-07-15 11:03:03 -03:00
locking docs: locking: add it to the main index 2019-07-15 11:03:03 -03:00
m68k docs: add arch doc directories to the index 2019-07-15 11:03:01 -03:00
maintainer docs: Add a document on repository management 2019-06-18 09:33:16 -06:00
media media: doc-rst: Fix typos 2019-06-27 07:35:47 -04:00
mic docs: driver-api: add remaining converted dirs to it 2019-07-15 11:03:03 -03:00
mips
misc-devices docs: misc-devices: convert files without extension to ReST 2019-07-03 21:09:41 +02:00
netlabel docs: add some directories to the main documentation index 2019-07-15 11:03:03 -03:00
networking Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-07-19 10:06:06 -07:00
nios2
openrisc
parisc
PCI Documentation: PCI: convert endpoint/pci-test-howto.txt to reST 2019-05-30 17:54:34 -05:00
pcmcia docs: add some directories to the main documentation index 2019-07-15 11:03:03 -03:00
power More power management updates for 5.3-rc1 2019-07-18 09:32:28 -07:00
powerpc docs: admin-guide: add kdump documentation into it 2019-07-15 11:03:01 -03:00
process docs conversion for v5.3-rc1 2019-07-16 12:21:41 -07:00
RCU It's been a relatively busy cycle for docs: 2019-07-09 12:34:26 -07:00
riscv RISC-V updates for v5.3 2019-07-18 12:26:59 -07:00
s390 docs: don't use nested tables 2019-07-15 11:03:04 -03:00
scheduler sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-08-08 09:09:30 +02:00
scsi scsi: ufs: Documentation: Announce ufs-tool v1.0 2019-06-26 22:47:51 -04:00
security docs: security: move some books to it and update 2019-07-15 11:03:01 -03:00
sh
sound
sparc docs: add arch doc directories to the index 2019-07-15 11:03:01 -03:00
sphinx docs: automarkup.py: ignore exceptions when seeking for xrefs 2019-07-08 14:35:47 -06:00
sphinx-static
spi
target docs: add some directories to the main documentation index 2019-07-15 11:03:03 -03:00
thermal docs: thermal: convert to ReST 2019-06-27 21:22:15 +08:00
timers docs: add some directories to the main documentation index 2019-07-15 11:03:03 -03:00
trace The main changes in this release include: 2019-07-18 11:51:00 -07:00
translations Remove references to dead website. 2019-07-19 12:22:04 -07:00
usb docs: usb: rename files to .rst and add them to drivers-api 2019-06-20 14:28:36 +02:00
userspace-api docs: ocxl.rst: add it to the uAPI book 2019-07-15 11:03:02 -03:00
virtual KVM: x86: Add fixed counters to PMU filter 2019-07-20 09:00:48 +02:00
vm mm: document ZONE_DEVICE memory-model implications 2019-07-18 17:08:07 -07:00
w1 docs: driver-api: add a series of orphaned documents 2019-07-15 11:03:02 -03:00
watchdog linux-watchdog 5.3-rc1 tag 2019-07-18 10:47:59 -07:00
wimax
x86 docs: admin-guide: add a series of orphaned documents 2019-07-15 11:03:02 -03:00
xtensa docs: add arch doc directories to the index 2019-07-15 11:03:01 -03:00
.gitignore
atomic_bitops.txt
atomic_t.txt Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-07-08 16:12:03 -07:00
bus-virt-phys-mapping.txt
Changes
CodingStyle
conf.py Disable Sphinx SmartyPants in HTML output 2019-06-30 15:30:34 -06:00
COPYING-logo docs: logo.txt: rename it to COPYING-logo 2019-07-15 09:20:27 -03:00
crc32.txt
debugging-modules.txt
debugging-via-ohci1394.txt
digsig.txt
DMA-API-HOWTO.txt docs: DMA-API-HOWTO.txt: fix an unmarked code block 2019-07-15 09:20:24 -03:00
DMA-API.txt Documentation: DMA-API: fix a function name of max_mapping_size 2019-06-07 11:10:33 -06:00
DMA-attributes.txt
DMA-ISA-LPC.txt
docutils.conf doc-rst: Add missing newline at end of file 2019-06-20 14:16:56 -06:00
dontdiff kbuild: create *.mod with full directory path and remove MODVERDIR 2019-07-18 02:19:31 +09:00
futex-requeue-pi.txt
hwspinlock.txt hwspinlock: add the 'in_atomic' API 2019-06-29 21:08:14 -07:00
index.rst docs conversion for v5.3-rc1 2019-07-16 12:21:41 -07:00
io_ordering.txt
io-mapping.txt
IPMI.txt
IRQ-affinity.txt
IRQ-domain.txt
IRQ.txt
irqflags-tracing.txt
Kconfig docs: Kbuild/Makefile: allow check for missing docs at build time 2019-06-07 11:33:16 -06:00
kobject.txt
kprobes.txt
kref.txt
logo.gif
lzo.txt
mailbox.txt
Makefile docs: Kbuild/Makefile: allow check for missing docs at build time 2019-06-07 11:33:16 -06:00
memory-barriers.txt It's been a relatively busy cycle for docs: 2019-07-09 12:34:26 -07:00
nommu-mmap.txt
packing.txt
padata.txt
percpu-rw-semaphore.txt
pi-futex.txt docs: locking: convert docs to ReST and rename to *.rst 2019-07-15 08:53:27 -03:00
preempt-locking.txt
rbtree.txt docs: rbtree.txt: fix Sphinx build warnings 2019-07-15 09:20:24 -03:00
remoteproc.txt remoteproc: add vendor resources handling 2019-06-29 12:02:17 -07:00
robust-futex-ABI.txt
robust-futexes.txt
rpmsg.txt
speculation.txt
static-keys.txt
SubmittingPatches
tee.txt Documentation: tee: Grammar s/the its/its/ 2019-06-07 11:23:38 -06:00
this_cpu_ops.txt
unaligned-memory-access.txt
xz.txt