linux-brain/drivers/md
Arne Welzel 80d1675903 dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc()
commit 528b16bfc3ae5f11638e71b3b63a81f9999df727 upstream.

On systems with many cores using dm-crypt, heavy spinlock contention in
percpu_counter_compare() can be observed when the page allocation limit
for a given device is reached or close to be reached. This is due
to percpu_counter_compare() taking a spinlock to compute an exact
result on potentially many CPUs at the same time.

Switch to non-exact comparison of allocated and allowed pages by using
the value returned by percpu_counter_read_positive() to avoid taking
the percpu_counter spinlock.

This may over/under estimate the actual number of allocated pages by at
most (batch-1) * num_online_cpus().

Currently, batch is bounded by 32. The system on which this issue was
first observed has 256 CPUs and 512GB of RAM. With a 4k page size, this
change may over/under estimate by 31MB. With ~10G (2%) allowed dm-crypt
allocations, this seems an acceptable error. Certainly preferred over
running into the spinlock contention.

This behavior was reproduced on an EC2 c5.24xlarge instance with 96 CPUs
and 192GB RAM as follows, but can be provoked on systems with less CPUs
as well.

 * Disable swap
 * Tune vm settings to promote regular writeback
     $ echo 50 > /proc/sys/vm/dirty_expire_centisecs
     $ echo 25 > /proc/sys/vm/dirty_writeback_centisecs
     $ echo $((128 * 1024 * 1024)) > /proc/sys/vm/dirty_background_bytes

 * Create 8 dmcrypt devices based on files on a tmpfs
 * Create and mount an ext4 filesystem on each crypt devices
 * Run stress-ng --hdd 8 within one of above filesystems

Total %system usage collected from sysstat goes to ~35%. Write throughput
on the underlying loop device is ~2GB/s. perf profiling an individual
kworker kcryptd thread shows the following profile, indicating spinlock
contention in percpu_counter_compare():

    99.98%     0.00%  kworker/u193:46  [kernel.kallsyms]  [k] ret_from_fork
      |
      --ret_from_fork
        kthread
        worker_thread
        |
         --99.92%--process_one_work
            |
            |--80.52%--kcryptd_crypt
            |    |
            |    |--62.58%--mempool_alloc
            |    |  |
            |    |   --62.24%--crypt_page_alloc
            |    |     |
            |    |      --61.51%--__percpu_counter_compare
            |    |        |
            |    |         --61.34%--__percpu_counter_sum
            |    |           |
            |    |           |--58.68%--_raw_spin_lock_irqsave
            |    |           |  |
            |    |           |   --58.30%--native_queued_spin_lock_slowpath
            |    |           |
            |    |            --0.69%--cpumask_next
            |    |                |
            |    |                 --0.51%--_find_next_bit
            |    |
            |    |--10.61%--crypt_convert
            |    |          |
            |    |          |--6.05%--xts_crypt
            ...

After applying this patch and running the same test, %system usage is
lowered to ~7% and write throughput on the loop device increases
to ~2.7GB/s. perf report shows mempool_alloc() as ~8% rather than ~62%
in the profile and not hitting the percpu_counter() spinlock anymore.

    |--8.15%--mempool_alloc
    |    |
    |    |--3.93%--crypt_page_alloc
    |    |    |
    |    |     --3.75%--__alloc_pages
    |    |         |
    |    |          --3.62%--get_page_from_freelist
    |    |              |
    |    |               --3.22%--rmqueue_bulk
    |    |                   |
    |    |                    --2.59%--_raw_spin_lock
    |    |                      |
    |    |                       --2.57%--native_queued_spin_lock_slowpath
    |    |
    |     --3.05%--_raw_spin_lock_irqsave
    |               |
    |                --2.49%--native_queued_spin_lock_slowpath

Suggested-by: DJ Gregor <dj@corelight.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Arne Welzel <arne.welzel@corelight.com>
Fixes: 5059353df8 ("dm crypt: limit the number of allocated pages")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-22 12:26:21 +02:00
..
bcache bcache: add proper error unwinding in bcache_device_init 2021-09-15 09:47:27 +02:00
persistent-data dm btree remove: assign new_root only when removal succeeds 2021-07-19 08:53:17 +02:00
Kconfig dm integrity: select CRYPTO_SKCIPHER 2021-01-27 11:47:42 +01:00
Makefile dm: add clone target 2019-09-12 09:32:31 -04:00
dm-bio-prison-v1.c dm: adjust structure members to improve alignment 2018-06-08 11:53:14 -04:00
dm-bio-prison-v1.h block: switch bios to blk_status_t 2017-06-09 09:27:32 -06:00
dm-bio-prison-v2.c dm: adjust structure members to improve alignment 2018-06-08 11:53:14 -04:00
dm-bio-prison-v2.h dm bio prison v2: new interface for the bio prison 2017-03-07 11:30:16 -05:00
dm-bio-record.h dm bio record: save/restore bi_end_io and bi_integrity 2020-03-25 08:25:48 +01:00
dm-bufio.c dm bufio: subtract the number of initial sectors in dm_bufio_get_device_size 2021-03-09 11:09:37 +01:00
dm-builtin.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
dm-cache-background-tracker.c dm cache background tracker: fix sparse warning 2018-04-30 15:40:40 -04:00
dm-cache-background-tracker.h dm cache: significant rework to leverage dm-bio-prison-v2 2017-03-07 13:28:31 -05:00
dm-cache-block-types.h
dm-cache-metadata.c dm cache metadata: Avoid returning cmd->bm wild pointer on error 2020-09-09 19:12:35 +02:00
dm-cache-metadata.h dm cache: significant rework to leverage dm-bio-prison-v2 2017-03-07 13:28:31 -05:00
dm-cache-policy-internal.h dm cache: significant rework to leverage dm-bio-prison-v2 2017-03-07 13:28:31 -05:00
dm-cache-policy-smq.c dm: remove unnecessary unlikely() around WARN_ON_ONCE() 2018-10-16 14:34:59 -04:00
dm-cache-policy.c
dm-cache-policy.h dm cache: significant rework to leverage dm-bio-prison-v2 2017-03-07 13:28:31 -05:00
dm-cache-target.c dm cache: fix a crash due to incorrect work item cancelling 2020-03-12 13:00:23 +01:00
dm-clone-metadata.c dm clone: Fix handling of partial region discards 2020-04-17 10:50:24 +02:00
dm-clone-metadata.h dm clone: replace spin_lock_irqsave with spin_lock_irq 2020-04-17 10:50:23 +02:00
dm-clone-target.c dm clone: Add missing casts to prevent overflows and data corruption 2020-04-17 10:50:24 +02:00
dm-core.h dm: fix deadlock when swapping to encrypted device 2021-03-04 10:26:51 +01:00
dm-crypt.c dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc() 2021-09-22 12:26:21 +02:00
dm-delay.c dm delay: fix a crash when invalid device is specified 2019-04-26 11:29:32 -04:00
dm-dust.c dm dust: use dust block size for badblocklist index 2019-08-21 11:27:17 -04:00
dm-era-target.c dm era: Update in-core bitset after committing the metadata 2021-03-04 10:26:53 +01:00
dm-exception-store.c
dm-exception-store.h - Improve DM snapshot target's scalability by using finer grained 2019-05-16 15:55:48 -07:00
dm-flakey.c block: Kill gfp_t argument of blkdev_report_zones() 2019-07-11 20:04:37 -06:00
dm-init.c docs: device-mapper: move it to the admin-guide 2019-07-15 11:03:01 -03:00
dm-integrity.c dm integrity: fix missing goto in bitmap_flush_interval error handling 2021-05-11 14:04:18 +02:00
dm-io.c dm: Use kzalloc for all structs with embedded biosets/mempools 2018-06-05 08:47:43 -06:00
dm-ioctl.c dm ioctl: fix out of bounds array access when no devices 2021-03-30 14:35:24 +02:00
dm-kcopyd.c dm kcopyd: always complete failed jobs 2019-08-15 15:57:39 -04:00
dm-linear.c block: Kill gfp_t argument of blkdev_report_zones() 2019-07-11 20:04:37 -06:00
dm-log-userspace-base.c dm: convert to bioset_init()/mempool_init() 2018-05-30 15:33:32 -06:00
dm-log-userspace-transfer.c
dm-log-userspace-transfer.h
dm-log-writes.c dm log writes: fix incorrect comment about the logged sequence example 2019-07-09 14:13:33 -04:00
dm-log.c
dm-mpath.c dm mpath: fix racey management of PG initialization 2020-09-09 19:12:35 +02:00
dm-mpath.h
dm-path-selector.c
dm-path-selector.h
dm-queue-length.c dm mpath selector: more evenly distribute ties 2018-01-29 13:44:58 -05:00
dm-raid.c dm raid: fix inconclusive reshape layout on fast raid4/5/6 table reload sequences 2021-05-11 14:04:15 +02:00
dm-raid1.c dm raid1: use struct_size() with kzalloc() 2019-08-26 11:05:32 -04:00
dm-region-hash.c - Error path bug fix for overflow tests (Dan) 2018-06-12 18:28:00 -07:00
dm-round-robin.c dm round robin: revert "use percpu 'repeat_count' and 'current_path'" 2017-02-17 00:54:09 -05:00
dm-rq.c dm rq: fix double free of blk_mq_tag_set in dev remove after table load fails 2021-05-11 14:04:18 +02:00
dm-rq.h dm: remove unused _rq_tio_cache and _rq_cache 2019-03-05 14:48:50 -05:00
dm-service-time.c dm mpath selector: more evenly distribute ties 2018-01-29 13:44:58 -05:00
dm-snap-persistent.c block: fix an integer overflow in logical block size 2020-01-23 08:22:32 +01:00
dm-snap-transient.c
dm-snap.c dm snapshot: properly fix a crash when an origin has no snapshots 2021-06-03 08:59:02 +02:00
dm-stats.c dm stats: use struct_size() helper 2019-09-04 09:39:22 -04:00
dm-stats.h License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
dm-stripe.c dax: Introduce a ->copy_to_iter dax operation 2018-05-22 23:18:31 -07:00
dm-switch.c dm switch: use struct_size() in kzalloc() 2019-03-05 14:48:51 -05:00
dm-sysfs.c dm: remove legacy request-based IO path 2018-10-11 11:36:09 -04:00
dm-table.c dm table: fix zoned iterate_devices based device capability checks 2021-03-11 14:06:49 +01:00
dm-target.c dm mpath: fix missing call of path selector type->end_io 2019-04-25 15:38:52 -04:00
dm-thin-metadata.c dm thin metadata: Fix use-after-free in dm_bm_set_read_only 2020-09-09 19:12:36 +02:00
dm-thin-metadata.h dm thin metadata: Add support for a pre-commit callback 2019-12-21 11:05:01 +01:00
dm-thin.c dm thin: don't allow changing data device during thin-pool reload 2020-02-24 08:36:49 +01:00
dm-uevent.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 156 2019-05-30 11:26:35 -07:00
dm-uevent.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 156 2019-05-30 11:26:35 -07:00
dm-unstripe.c dm: Check for device sector overflow if CONFIG_LBDAF is not set 2018-12-18 09:02:26 -05:00
dm-verity-fec.c dm verity fec: fix misaligned RS roots IO 2021-04-21 12:56:15 +02:00
dm-verity-fec.h dm verity fec: fix misaligned RS roots IO 2021-04-21 12:56:15 +02:00
dm-verity-target.c dm verity: fix DM_VERITY_OPTS_MAX value 2021-03-30 14:35:24 +02:00
dm-verity-verify-sig.c dm verity: fix require_signatures module_param permissions 2021-06-16 11:59:37 +02:00
dm-verity-verify-sig.h dm verity: add root hash pkcs#7 signature verification 2019-08-23 10:13:14 -04:00
dm-verity.h dm verity: add root hash pkcs#7 signature verification 2019-08-23 10:13:14 -04:00
dm-writecache.c dm writecache: return the exact table values that were set 2021-07-25 14:35:14 +02:00
dm-zero.c dm: don't return errnos from ->map 2017-06-09 09:27:32 -06:00
dm-zoned-metadata.c dm zoned: return NULL if dmz_get_zone_for_reclaim() fails to find a zone 2020-06-24 17:50:31 +02:00
dm-zoned-reclaim.c dm zoned: return NULL if dmz_get_zone_for_reclaim() fails to find a zone 2020-06-24 17:50:31 +02:00
dm-zoned-target.c dm zoned: assign max_io_len correctly 2020-07-09 09:37:57 +02:00
dm-zoned.h dm zoned: reduce overhead of backing device checks 2019-12-17 19:56:12 +01:00
dm.c dm table: fix DAX iterate_devices based device capability checks 2021-03-11 14:06:49 +01:00
dm.h dm table: fix DAX iterate_devices based device capability checks 2021-03-11 14:06:49 +01:00
md-bitmap.c md/bitmap: wait for external bitmap writes to complete during tear down 2021-05-14 09:44:12 +02:00
md-bitmap.h md: Avoid namespace collision with bitmap API 2018-08-01 15:49:39 -07:00
md-cluster.c md/cluster: fix deadlock when node is doing resync job 2020-12-30 11:51:45 +01:00
md-cluster.h md-cluster: introduce resync_info_get interface for sanity check 2018-10-18 09:36:35 -07:00
md-faulty.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 47 2019-05-24 17:27:13 +02:00
md-linear.c md: improve handling of bio with REQ_PREFLUSH in md_flush_request() 2019-12-17 19:56:14 +01:00
md-linear.h Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md 2017-11-14 16:07:26 -08:00
md-multipath.c md: improve handling of bio with REQ_PREFLUSH in md_flush_request() 2019-12-17 19:56:14 +01:00
md-multipath.h md: convert to bioset_init()/mempool_init() 2018-05-30 15:33:32 -06:00
md.c md: Fix missing unused status line of /proc/mdstat 2021-05-14 09:44:13 +02:00
md.h md: improve handling of bio with REQ_PREFLUSH in md_flush_request() 2019-12-17 19:56:14 +01:00
raid0.c block: fix an integer overflow in logical block size 2020-01-23 08:22:32 +01:00
raid0.h md/raid0: avoid RAID0 data corruption due to layout confusion. 2019-09-13 13:10:05 -07:00
raid1-10.c md: raid1-10: Unify r{1,10}bio_pool_free 2019-06-15 01:37:35 -06:00
raid1.c md/raid10: properly indicate failure when ending a failed write request 2021-08-12 13:21:03 +02:00
raid1.h md: convert to bioset_init()/mempool_init() 2018-05-30 15:33:32 -06:00
raid5-cache.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 288 2019-06-05 17:36:37 +02:00
raid5-log.h raid5: set write hint for PPL 2019-03-12 10:15:18 -07:00
raid5-ppl.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 288 2019-06-05 17:36:37 +02:00
raid5.c md/raid5: fix oops during stripe resizing 2020-11-05 11:43:22 +01:00
raid5.h raid5: use bio_end_sector in r5_next_bio 2019-09-13 13:14:43 -07:00
raid10.c md/raid10: properly indicate failure when ending a failed write request 2021-08-12 13:21:03 +02:00
raid10.h md: convert to bioset_init()/mempool_init() 2018-05-30 15:33:32 -06:00