Commit Graph

1024 Commits

Author SHA1 Message Date
Anand Jain
fb4c7d2923 btrfs: fix lockdep warning while mounting sprout fs
[ Upstream commit c124706900c20dee70f921bb3a90492431561a0a ]

Following test case reproduces lockdep warning.

  Test case:

  $ mkfs.btrfs -f <dev1>
  $ btrfstune -S 1 <dev1>
  $ mount <dev1> <mnt>
  $ btrfs device add <dev2> <mnt> -f
  $ umount <mnt>
  $ mount <dev2> <mnt>
  $ umount <mnt>

The warning claims a possible ABBA deadlock between the threads
initiated by [#1] btrfs device add and [#0] the mount.

  [ 540.743122] WARNING: possible circular locking dependency detected
  [ 540.743129] 5.11.0-rc7+ #5 Not tainted
  [ 540.743135] ------------------------------------------------------
  [ 540.743142] mount/2515 is trying to acquire lock:
  [ 540.743149] ffffa0c5544c2ce0 (&fs_devs->device_list_mutex){+.+.}-{4:4}, at: clone_fs_devices+0x6d/0x210 [btrfs]
  [ 540.743458] but task is already holding lock:
  [ 540.743461] ffffa0c54a7932b8 (btrfs-chunk-00){++++}-{4:4}, at: __btrfs_tree_read_lock+0x32/0x200 [btrfs]
  [ 540.743541] which lock already depends on the new lock.
  [ 540.743543] the existing dependency chain (in reverse order) is:

  [ 540.743546] -> #1 (btrfs-chunk-00){++++}-{4:4}:
  [ 540.743566] down_read_nested+0x48/0x2b0
  [ 540.743585] __btrfs_tree_read_lock+0x32/0x200 [btrfs]
  [ 540.743650] btrfs_read_lock_root_node+0x70/0x200 [btrfs]
  [ 540.743733] btrfs_search_slot+0x6c6/0xe00 [btrfs]
  [ 540.743785] btrfs_update_device+0x83/0x260 [btrfs]
  [ 540.743849] btrfs_finish_chunk_alloc+0x13f/0x660 [btrfs] <--- device_list_mutex
  [ 540.743911] btrfs_create_pending_block_groups+0x18d/0x3f0 [btrfs]
  [ 540.743982] btrfs_commit_transaction+0x86/0x1260 [btrfs]
  [ 540.744037] btrfs_init_new_device+0x1600/0x1dd0 [btrfs]
  [ 540.744101] btrfs_ioctl+0x1c77/0x24c0 [btrfs]
  [ 540.744166] __x64_sys_ioctl+0xe4/0x140
  [ 540.744170] do_syscall_64+0x4b/0x80
  [ 540.744174] entry_SYSCALL_64_after_hwframe+0x44/0xa9

  [ 540.744180] -> #0 (&fs_devs->device_list_mutex){+.+.}-{4:4}:
  [ 540.744184] __lock_acquire+0x155f/0x2360
  [ 540.744188] lock_acquire+0x10b/0x5c0
  [ 540.744190] __mutex_lock+0xb1/0xf80
  [ 540.744193] mutex_lock_nested+0x27/0x30
  [ 540.744196] clone_fs_devices+0x6d/0x210 [btrfs]
  [ 540.744270] btrfs_read_chunk_tree+0x3c7/0xbb0 [btrfs]
  [ 540.744336] open_ctree+0xf6e/0x2074 [btrfs]
  [ 540.744406] btrfs_mount_root.cold.72+0x16/0x127 [btrfs]
  [ 540.744472] legacy_get_tree+0x38/0x90
  [ 540.744475] vfs_get_tree+0x30/0x140
  [ 540.744478] fc_mount+0x16/0x60
  [ 540.744482] vfs_kern_mount+0x91/0x100
  [ 540.744484] btrfs_mount+0x1e6/0x670 [btrfs]
  [ 540.744536] legacy_get_tree+0x38/0x90
  [ 540.744537] vfs_get_tree+0x30/0x140
  [ 540.744539] path_mount+0x8d8/0x1070
  [ 540.744541] do_mount+0x8d/0xc0
  [ 540.744543] __x64_sys_mount+0x125/0x160
  [ 540.744545] do_syscall_64+0x4b/0x80
  [ 540.744547] entry_SYSCALL_64_after_hwframe+0x44/0xa9

  [ 540.744551] other info that might help us debug this:
  [ 540.744552] Possible unsafe locking scenario:

  [ 540.744553] CPU0 				CPU1
  [ 540.744554] ---- 				----
  [ 540.744555] lock(btrfs-chunk-00);
  [ 540.744557] 					lock(&fs_devs->device_list_mutex);
  [ 540.744560] 					lock(btrfs-chunk-00);
  [ 540.744562] lock(&fs_devs->device_list_mutex);
  [ 540.744564]
   *** DEADLOCK ***

  [ 540.744565] 3 locks held by mount/2515:
  [ 540.744567] #0: ffffa0c56bf7a0e0 (&type->s_umount_key#42/1){+.+.}-{4:4}, at: alloc_super.isra.16+0xdf/0x450
  [ 540.744574] #1: ffffffffc05a9628 (uuid_mutex){+.+.}-{4:4}, at: btrfs_read_chunk_tree+0x63/0xbb0 [btrfs]
  [ 540.744640] #2: ffffa0c54a7932b8 (btrfs-chunk-00){++++}-{4:4}, at: __btrfs_tree_read_lock+0x32/0x200 [btrfs]
  [ 540.744708]
   stack backtrace:
  [ 540.744712] CPU: 2 PID: 2515 Comm: mount Not tainted 5.11.0-rc7+ #5

But the device_list_mutex in clone_fs_devices() is redundant, as
explained below.  Two threads [1]  and [2] (below) could lead to
clone_fs_device().

  [1]
  open_ctree <== mount sprout fs
   btrfs_read_chunk_tree()
    mutex_lock(&uuid_mutex) <== global lock
    read_one_dev()
     open_seed_devices()
      clone_fs_devices() <== seed fs_devices
       mutex_lock(&orig->device_list_mutex) <== seed fs_devices

  [2]
  btrfs_init_new_device() <== sprouting
   mutex_lock(&uuid_mutex); <== global lock
   btrfs_prepare_sprout()
     lockdep_assert_held(&uuid_mutex)
     clone_fs_devices(seed_fs_device) <== seed fs_devices

Both of these threads hold uuid_mutex which is sufficient to protect
getting the seed device(s) freed while we are trying to clone it for
sprouting [2] or mounting a sprout [1] (as above). A mounted seed device
can not free/write/replace because it is read-only. An unmounted seed
device can be freed by btrfs_free_stale_devices(), but it needs
uuid_mutex.  So this patch removes the unnecessary device_list_mutex in
clone_fs_devices().  And adds a lockdep_assert_held(&uuid_mutex) in
clone_fs_devices().

Reported-by: Su Yue <l@damenly.su>
Tested-by: Su Yue <l@damenly.su>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-26 14:07:12 +02:00
Desmond Cheong Zhi Xi
cd7b39e7c4 btrfs: reset replace target device to allocation state on close
commit 0d977e0eba234e01a60bdde27314dc21374201b3 upstream.

This crash was observed with a failed assertion on device close:

  BTRFS: Transaction aborted (error -28)
  WARNING: CPU: 1 PID: 3902 at fs/btrfs/extent-tree.c:2150 btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs]
  Modules linked in: btrfs blake2b_generic libcrc32c crc32c_intel xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress raid6_pq loop
  CPU: 1 PID: 3902 Comm: kworker/u8:4 Not tainted 5.14.0-rc5-default+ #1532
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
  Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
  RIP: 0010:btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs]
  RSP: 0018:ffffb7a5452d7d80 EFLAGS: 00010282
  RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: ffffffffabee13c4 RDI: 00000000ffffffff
  RBP: ffff97834176a378 R08: 0000000000000001 R09: 0000000000000001
  R10: 0000000000000000 R11: 0000000000000001 R12: ffff97835195d388
  R13: 0000000005b08000 R14: ffff978385484000 R15: 000000000000016c
  FS:  0000000000000000(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000056190d003fe8 CR3: 000000002a81e005 CR4: 0000000000170ea0
  Call Trace:
   flush_space+0x197/0x2f0 [btrfs]
   btrfs_async_reclaim_metadata_space+0x139/0x300 [btrfs]
   process_one_work+0x262/0x5e0
   worker_thread+0x4c/0x320
   ? process_one_work+0x5e0/0x5e0
   kthread+0x144/0x170
   ? set_kthread_struct+0x40/0x40
   ret_from_fork+0x1f/0x30
  irq event stamp: 19334989
  hardirqs last  enabled at (19334997): [<ffffffffab0e0c87>] console_unlock+0x2b7/0x400
  hardirqs last disabled at (19335006): [<ffffffffab0e0d0d>] console_unlock+0x33d/0x400
  softirqs last  enabled at (19334900): [<ffffffffaba0030d>] __do_softirq+0x30d/0x574
  softirqs last disabled at (19334893): [<ffffffffab0721ec>] irq_exit_rcu+0x12c/0x140
  ---[ end trace 45939e308e0dd3c7 ]---
  BTRFS: error (device vdd) in btrfs_run_delayed_refs:2150: errno=-28 No space left
  BTRFS info (device vdd): forced readonly
  BTRFS warning (device vdd): failed setting block group ro: -30
  BTRFS info (device vdd): suspending dev_replace for unmount
  assertion failed: !test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state), in fs/btrfs/volumes.c:1150
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/ctree.h:3431!
  invalid opcode: 0000 [#1] PREEMPT SMP
  CPU: 1 PID: 3982 Comm: umount Tainted: G        W         5.14.0-rc5-default+ #1532
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
  RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
  RSP: 0018:ffffb7a5454c7db8 EFLAGS: 00010246
  RAX: 0000000000000068 RBX: ffff978364b91c00 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: ffffffffabee13c4 RDI: 00000000ffffffff
  RBP: ffff9783523a4c00 R08: 0000000000000001 R09: 0000000000000001
  R10: 0000000000000000 R11: 0000000000000001 R12: ffff9783523a4d18
  R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000003
  FS:  00007f61c8f42800(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000056190cffa810 CR3: 0000000030b96002 CR4: 0000000000170ea0
  Call Trace:
   btrfs_close_one_device.cold+0x11/0x55 [btrfs]
   close_fs_devices+0x44/0xb0 [btrfs]
   btrfs_close_devices+0x48/0x160 [btrfs]
   generic_shutdown_super+0x69/0x100
   kill_anon_super+0x14/0x30
   btrfs_kill_super+0x12/0x20 [btrfs]
   deactivate_locked_super+0x2c/0xa0
   cleanup_mnt+0x144/0x1b0
   task_work_run+0x59/0xa0
   exit_to_user_mode_loop+0xe7/0xf0
   exit_to_user_mode_prepare+0xaf/0xf0
   syscall_exit_to_user_mode+0x19/0x50
   do_syscall_64+0x4a/0x90
   entry_SYSCALL_64_after_hwframe+0x44/0xae

This happens when close_ctree is called while a dev_replace hasn't
completed. In close_ctree, we suspend the dev_replace, but keep the
replace target around so that we can resume the dev_replace procedure
when we mount the root again. This is the call trace:

  close_ctree():
    btrfs_dev_replace_suspend_for_unmount();
    btrfs_close_devices():
      btrfs_close_fs_devices():
        btrfs_close_one_device():
          ASSERT(!test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
                 &device->dev_state));

However, since the replace target sticks around, there is a device
with BTRFS_DEV_STATE_REPLACE_TGT set on close, and we fail the
assertion in btrfs_close_one_device.

To fix this, if we come across the replace target device when
closing, we should properly reset it back to allocation state. This
fix also ensures that if a non-target device has a corrupted state and
has the BTRFS_DEV_STATE_REPLACE_TGT bit set, the assertion will still
catch the error.

Reported-by: David Sterba <dsterba@suse.com>
Fixes: b2a616676839 ("btrfs: fix rw device counting in __btrfs_free_extra_devids")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-22 12:26:19 +02:00
Qu Wenruo
d7f7eca72e btrfs: fix NULL pointer dereference when deleting device by invalid id
commit e4571b8c5e9ffa1e85c0c671995bd4dcc5c75091 upstream.

[BUG]
It's easy to trigger NULL pointer dereference, just by removing a
non-existing device id:

 # mkfs.btrfs -f -m single -d single /dev/test/scratch1 \
				     /dev/test/scratch2
 # mount /dev/test/scratch1 /mnt/btrfs
 # btrfs device remove 3 /mnt/btrfs

Then we have the following kernel NULL pointer dereference:

 BUG: kernel NULL pointer dereference, address: 0000000000000000
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] PREEMPT SMP NOPTI
 CPU: 9 PID: 649 Comm: btrfs Not tainted 5.14.0-rc3-custom+ #35
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
 RIP: 0010:btrfs_rm_device+0x4de/0x6b0 [btrfs]
  btrfs_ioctl+0x18bb/0x3190 [btrfs]
  ? lock_is_held_type+0xa5/0x120
  ? find_held_lock.constprop.0+0x2b/0x80
  ? do_user_addr_fault+0x201/0x6a0
  ? lock_release+0xd2/0x2d0
  ? __x64_sys_ioctl+0x83/0xb0
  __x64_sys_ioctl+0x83/0xb0
  do_syscall_64+0x3b/0x90
  entry_SYSCALL_64_after_hwframe+0x44/0xae

[CAUSE]
Commit a27a94c2b0 ("btrfs: Make btrfs_find_device_by_devspec return
btrfs_device directly") moves the "missing" device path check into
btrfs_rm_device().

But btrfs_rm_device() itself can have case where it only receives
@devid, with NULL as @device_path.

In that case, calling strcmp() on NULL will trigger the NULL pointer
dereference.

Before that commit, we handle the "missing" case inside
btrfs_find_device_by_devspec(), which will not check @device_path at all
if @devid is provided, thus no way to trigger the bug.

[FIX]
Before calling strcmp(), also make sure @device_path is not NULL.

Fixes: a27a94c2b0 ("btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly")
CC: stable@vger.kernel.org # 5.4+
Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-03 10:08:16 +02:00
Desmond Cheong Zhi Xi
57429c1ec7 btrfs: fix rw device counting in __btrfs_free_extra_devids
commit b2a616676839e2a6b02c8e40be7f886f882ed194 upstream.

When removing a writeable device in __btrfs_free_extra_devids, the rw
device count should be decremented.

This error was caught by Syzbot which reported a warning in
close_fs_devices:

  WARNING: CPU: 1 PID: 9355 at fs/btrfs/volumes.c:1168 close_fs_devices+0x763/0x880 fs/btrfs/volumes.c:1168
  Modules linked in:
  CPU: 0 PID: 9355 Comm: syz-executor552 Not tainted 5.13.0-rc1-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  RIP: 0010:close_fs_devices+0x763/0x880 fs/btrfs/volumes.c:1168
  RSP: 0018:ffffc9000333f2f0 EFLAGS: 00010293
  RAX: ffffffff8365f5c3 RBX: 0000000000000001 RCX: ffff888029afd4c0
  RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
  RBP: ffff88802846f508 R08: ffffffff8365f525 R09: ffffed100337d128
  R10: ffffed100337d128 R11: 0000000000000000 R12: dffffc0000000000
  R13: ffff888019be8868 R14: 1ffff1100337d10d R15: 1ffff1100337d10a
  FS:  00007f6f53828700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000000000047c410 CR3: 00000000302a6000 CR4: 00000000001506f0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   btrfs_close_devices+0xc9/0x450 fs/btrfs/volumes.c:1180
   open_ctree+0x8e1/0x3968 fs/btrfs/disk-io.c:3693
   btrfs_fill_super fs/btrfs/super.c:1382 [inline]
   btrfs_mount_root+0xac5/0xc60 fs/btrfs/super.c:1749
   legacy_get_tree+0xea/0x180 fs/fs_context.c:592
   vfs_get_tree+0x86/0x270 fs/super.c:1498
   fc_mount fs/namespace.c:993 [inline]
   vfs_kern_mount+0xc9/0x160 fs/namespace.c:1023
   btrfs_mount+0x3d3/0xb50 fs/btrfs/super.c:1809
   legacy_get_tree+0xea/0x180 fs/fs_context.c:592
   vfs_get_tree+0x86/0x270 fs/super.c:1498
   do_new_mount fs/namespace.c:2905 [inline]
   path_mount+0x196f/0x2be0 fs/namespace.c:3235
   do_mount fs/namespace.c:3248 [inline]
   __do_sys_mount fs/namespace.c:3456 [inline]
   __se_sys_mount+0x2f9/0x3b0 fs/namespace.c:3433
   do_syscall_64+0x3f/0xb0 arch/x86/entry/common.c:47
   entry_SYSCALL_64_after_hwframe+0x44/0xae

Because fs_devices->rw_devices was not 0 after
closing all devices. Here is the call trace that was observed:

  btrfs_mount_root():
    btrfs_scan_one_device():
      device_list_add();   <---------------- device added
    btrfs_open_devices():
      open_fs_devices():
        btrfs_open_one_device();   <-------- writable device opened,
	                                     rw device count ++
    btrfs_fill_super():
      open_ctree():
        btrfs_free_extra_devids():
	  __btrfs_free_extra_devids();  <--- writable device removed,
	                              rw device count not decremented
	  fail_tree_roots:
	    btrfs_close_devices():
	      close_fs_devices();   <------- rw device count off by 1

As a note, prior to commit cf89af146b7e ("btrfs: dev-replace: fail
mount if we don't have replace item with target device"), rw_devices
was decremented on removing a writable device in
__btrfs_free_extra_devids only if the BTRFS_DEV_STATE_REPLACE_TGT bit
was not set for the device. However, this check does not need to be
reinstated as it is now redundant and incorrect.

In __btrfs_free_extra_devids, we skip removing the device if it is the
target for replacement. This is done by checking whether device->devid
== BTRFS_DEV_REPLACE_DEVID. Since BTRFS_DEV_STATE_REPLACE_TGT is set
only on the device with devid BTRFS_DEV_REPLACE_DEVID, no devices
should have the BTRFS_DEV_STATE_REPLACE_TGT bit set after the check,
and so it's redundant to test for that bit.

Additionally, following commit 82372bc816 ("Btrfs: make
the logic of source device removing more clear"), rw_devices is
incremented whenever a writeable device is added to the alloc
list (including the target device in btrfs_dev_replace_finishing), so
all removals of writable devices from the alloc list should also be
accompanied by a decrement to rw_devices.

Reported-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
Fixes: cf89af146b7e ("btrfs: dev-replace: fail mount if we don't have replace item with target device")
CC: stable@vger.kernel.org # 5.10+
Tested-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-04 12:27:37 +02:00
Josef Bacik
e1065331b7 btrfs: fix lockdep splat in btrfs_recover_relocation
commit fb286100974e7239af243bc2255a52f29442f9c8 upstream.

While testing the error paths of relocation I hit the following lockdep
splat:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.10.0-rc6+ #217 Not tainted
  ------------------------------------------------------
  mount/779 is trying to acquire lock:
  ffffa0e676945418 (&fs_info->balance_mutex){+.+.}-{3:3}, at: btrfs_recover_balance+0x2f0/0x340

  but task is already holding lock:
  ffffa0e60ee31da8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x100

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #2 (btrfs-root-00){++++}-{3:3}:
	 down_read_nested+0x43/0x130
	 __btrfs_tree_read_lock+0x27/0x100
	 btrfs_read_lock_root_node+0x31/0x40
	 btrfs_search_slot+0x462/0x8f0
	 btrfs_update_root+0x55/0x2b0
	 btrfs_drop_snapshot+0x398/0x750
	 clean_dirty_subvols+0xdf/0x120
	 btrfs_recover_relocation+0x534/0x5a0
	 btrfs_start_pre_rw_mount+0xcb/0x170
	 open_ctree+0x151f/0x1726
	 btrfs_mount_root.cold+0x12/0xea
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 vfs_kern_mount.part.0+0x71/0xb0
	 btrfs_mount+0x10d/0x380
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 path_mount+0x433/0xc10
	 __x64_sys_mount+0xe3/0x120
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #1 (sb_internal#2){.+.+}-{0:0}:
	 start_transaction+0x444/0x700
	 insert_balance_item.isra.0+0x37/0x320
	 btrfs_balance+0x354/0xf40
	 btrfs_ioctl_balance+0x2cf/0x380
	 __x64_sys_ioctl+0x83/0xb0
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (&fs_info->balance_mutex){+.+.}-{3:3}:
	 __lock_acquire+0x1120/0x1e10
	 lock_acquire+0x116/0x370
	 __mutex_lock+0x7e/0x7b0
	 btrfs_recover_balance+0x2f0/0x340
	 open_ctree+0x1095/0x1726
	 btrfs_mount_root.cold+0x12/0xea
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 vfs_kern_mount.part.0+0x71/0xb0
	 btrfs_mount+0x10d/0x380
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 path_mount+0x433/0xc10
	 __x64_sys_mount+0xe3/0x120
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  other info that might help us debug this:

  Chain exists of:
    &fs_info->balance_mutex --> sb_internal#2 --> btrfs-root-00

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(btrfs-root-00);
				 lock(sb_internal#2);
				 lock(btrfs-root-00);
    lock(&fs_info->balance_mutex);

   *** DEADLOCK ***

  2 locks held by mount/779:
   #0: ffffa0e60dc040e0 (&type->s_umount_key#47/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380
   #1: ffffa0e60ee31da8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x100

  stack backtrace:
  CPU: 0 PID: 779 Comm: mount Not tainted 5.10.0-rc6+ #217
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  Call Trace:
   dump_stack+0x8b/0xb0
   check_noncircular+0xcf/0xf0
   ? trace_call_bpf+0x139/0x260
   __lock_acquire+0x1120/0x1e10
   lock_acquire+0x116/0x370
   ? btrfs_recover_balance+0x2f0/0x340
   __mutex_lock+0x7e/0x7b0
   ? btrfs_recover_balance+0x2f0/0x340
   ? btrfs_recover_balance+0x2f0/0x340
   ? rcu_read_lock_sched_held+0x3f/0x80
   ? kmem_cache_alloc_trace+0x2c4/0x2f0
   ? btrfs_get_64+0x5e/0x100
   btrfs_recover_balance+0x2f0/0x340
   open_ctree+0x1095/0x1726
   btrfs_mount_root.cold+0x12/0xea
   ? rcu_read_lock_sched_held+0x3f/0x80
   legacy_get_tree+0x30/0x50
   vfs_get_tree+0x28/0xc0
   vfs_kern_mount.part.0+0x71/0xb0
   btrfs_mount+0x10d/0x380
   ? __kmalloc_track_caller+0x2f2/0x320
   legacy_get_tree+0x30/0x50
   vfs_get_tree+0x28/0xc0
   ? capable+0x3a/0x60
   path_mount+0x433/0xc10
   __x64_sys_mount+0xe3/0x120
   do_syscall_64+0x33/0x40
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This is straightforward to fix, simply release the path before we setup
the balance_ctl.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27 11:47:40 +01:00
Qu Wenruo
e21d630a2c btrfs: trim: fix underflow in trim length to prevent access beyond device boundary
commit c57dd1f2f6a7cd1bb61802344f59ccdc5278c983 upstream

[BUG]
The following script can lead to tons of beyond device boundary access:

  mkfs.btrfs -f $dev -b 10G
  mount $dev $mnt
  trimfs $mnt
  btrfs filesystem resize 1:-1G $mnt
  trimfs $mnt

[CAUSE]
Since commit 929be17a9b ("btrfs: Switch btrfs_trim_free_extents to
find_first_clear_extent_bit"), we try to avoid trimming ranges that's
already trimmed.

So we check device->alloc_state by finding the first range which doesn't
have CHUNK_TRIMMED and CHUNK_ALLOCATED not set.

But if we shrunk the device, that bits are not cleared, thus we could
easily got a range starts beyond the shrunk device size.

This results the returned @start and @end are all beyond device size,
then we call "end = min(end, device->total_bytes -1);" making @end
smaller than device size.

Then finally we goes "len = end - start + 1", totally underflow the
result, and lead to the beyond-device-boundary access.

[FIX]
This patch will fix the problem in two ways:

- Clear CHUNK_TRIMMED | CHUNK_ALLOCATED bits when shrinking device
  This is the root fix

- Add extra safety check when trimming free device extents
  We check and warn if the returned range is already beyond current
  device.

Link: https://github.com/kdave/btrfs-progs/issues/282
Fixes: 929be17a9b ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[sudip: adjust context and use extent_io.h]
Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30 11:51:37 +01:00
Johannes Thumshirn
6ea14731ac btrfs: don't access possibly stale fs_info data for printing duplicate device
commit 0697d9a610998b8bdee6b2390836cb2391d8fd1a upstream.

Syzbot reported a possible use-after-free when printing a duplicate device
warning device_list_add().

At this point it can happen that a btrfs_device::fs_info is not correctly
setup yet, so we're accessing stale data, when printing the warning
message using the btrfs_printk() wrappers.

  ==================================================================
  BUG: KASAN: use-after-free in btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
  Read of size 8 at addr ffff8880878e06a8 by task syz-executor225/7068

  CPU: 1 PID: 7068 Comm: syz-executor225 Not tainted 5.9.0-rc5-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  Call Trace:
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0x1d6/0x29e lib/dump_stack.c:118
   print_address_description+0x66/0x620 mm/kasan/report.c:383
   __kasan_report mm/kasan/report.c:513 [inline]
   kasan_report+0x132/0x1d0 mm/kasan/report.c:530
   btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
   device_list_add+0x1a88/0x1d60 fs/btrfs/volumes.c:943
   btrfs_scan_one_device+0x196/0x490 fs/btrfs/volumes.c:1359
   btrfs_mount_root+0x48f/0xb60 fs/btrfs/super.c:1634
   legacy_get_tree+0xea/0x180 fs/fs_context.c:592
   vfs_get_tree+0x88/0x270 fs/super.c:1547
   fc_mount fs/namespace.c:978 [inline]
   vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
   btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
   legacy_get_tree+0xea/0x180 fs/fs_context.c:592
   vfs_get_tree+0x88/0x270 fs/super.c:1547
   do_new_mount fs/namespace.c:2875 [inline]
   path_mount+0x179d/0x29e0 fs/namespace.c:3192
   do_mount fs/namespace.c:3205 [inline]
   __do_sys_mount fs/namespace.c:3413 [inline]
   __se_sys_mount+0x126/0x180 fs/namespace.c:3390
   do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x44840a
  RSP: 002b:00007ffedfffd608 EFLAGS: 00000293 ORIG_RAX: 00000000000000a5
  RAX: ffffffffffffffda RBX: 00007ffedfffd670 RCX: 000000000044840a
  RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffedfffd630
  RBP: 00007ffedfffd630 R08: 00007ffedfffd670 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000001a
  R13: 0000000000000004 R14: 0000000000000003 R15: 0000000000000003

  Allocated by task 6945:
   kasan_save_stack mm/kasan/common.c:48 [inline]
   kasan_set_track mm/kasan/common.c:56 [inline]
   __kasan_kmalloc+0x100/0x130 mm/kasan/common.c:461
   kmalloc_node include/linux/slab.h:577 [inline]
   kvmalloc_node+0x81/0x110 mm/util.c:574
   kvmalloc include/linux/mm.h:757 [inline]
   kvzalloc include/linux/mm.h:765 [inline]
   btrfs_mount_root+0xd0/0xb60 fs/btrfs/super.c:1613
   legacy_get_tree+0xea/0x180 fs/fs_context.c:592
   vfs_get_tree+0x88/0x270 fs/super.c:1547
   fc_mount fs/namespace.c:978 [inline]
   vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
   btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
   legacy_get_tree+0xea/0x180 fs/fs_context.c:592
   vfs_get_tree+0x88/0x270 fs/super.c:1547
   do_new_mount fs/namespace.c:2875 [inline]
   path_mount+0x179d/0x29e0 fs/namespace.c:3192
   do_mount fs/namespace.c:3205 [inline]
   __do_sys_mount fs/namespace.c:3413 [inline]
   __se_sys_mount+0x126/0x180 fs/namespace.c:3390
   do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  Freed by task 6945:
   kasan_save_stack mm/kasan/common.c:48 [inline]
   kasan_set_track+0x3d/0x70 mm/kasan/common.c:56
   kasan_set_free_info+0x17/0x30 mm/kasan/generic.c:355
   __kasan_slab_free+0xdd/0x110 mm/kasan/common.c:422
   __cache_free mm/slab.c:3418 [inline]
   kfree+0x113/0x200 mm/slab.c:3756
   deactivate_locked_super+0xa7/0xf0 fs/super.c:335
   btrfs_mount_root+0x72b/0xb60 fs/btrfs/super.c:1678
   legacy_get_tree+0xea/0x180 fs/fs_context.c:592
   vfs_get_tree+0x88/0x270 fs/super.c:1547
   fc_mount fs/namespace.c:978 [inline]
   vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
   btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
   legacy_get_tree+0xea/0x180 fs/fs_context.c:592
   vfs_get_tree+0x88/0x270 fs/super.c:1547
   do_new_mount fs/namespace.c:2875 [inline]
   path_mount+0x179d/0x29e0 fs/namespace.c:3192
   do_mount fs/namespace.c:3205 [inline]
   __do_sys_mount fs/namespace.c:3413 [inline]
   __se_sys_mount+0x126/0x180 fs/namespace.c:3390
   do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  The buggy address belongs to the object at ffff8880878e0000
   which belongs to the cache kmalloc-16k of size 16384
  The buggy address is located 1704 bytes inside of
   16384-byte region [ffff8880878e0000, ffff8880878e4000)
  The buggy address belongs to the page:
  page:0000000060704f30 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x878e0
  head:0000000060704f30 order:3 compound_mapcount:0 compound_pincount:0
  flags: 0xfffe0000010200(slab|head)
  raw: 00fffe0000010200 ffffea00028e9a08 ffffea00021e3608 ffff8880aa440b00
  raw: 0000000000000000 ffff8880878e0000 0000000100000001 0000000000000000
  page dumped because: kasan: bad access detected

  Memory state around the buggy address:
   ffff8880878e0580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff8880878e0600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  >ffff8880878e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
				    ^
   ffff8880878e0700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff8880878e0780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ==================================================================

The syzkaller reproducer for this use-after-free crafts a filesystem image
and loop mounts it twice in a loop. The mount will fail as the crafted
image has an invalid chunk tree. When this happens btrfs_mount_root() will
call deactivate_locked_super(), which then cleans up fs_info and
fs_info::sb. If a second thread now adds the same block-device to the
filesystem, it will get detected as a duplicate device and
device_list_add() will reject the duplicate and print a warning. But as
the fs_info pointer passed in is non-NULL this will result in a
use-after-free.

Instead of printing possibly uninitialized or already freed memory in
btrfs_printk(), explicitly pass in a NULL fs_info so the printing of the
device name will be skipped altogether.

There was a slightly different approach discussed in
https://lore.kernel.org/linux-btrfs/20200114060920.4527-1-anand.jain@oracle.com/t/#u

Link: https://lore.kernel.org/linux-btrfs/000000000000c9e14b05afcc41ba@google.com
Reported-by: syzbot+582e66e5edf36a22c7b0@syzkaller.appspotmail.com
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-02 08:49:45 +01:00
Anand Jain
2033dd8852 btrfs: dev-replace: fail mount if we don't have replace item with target device
commit cf89af146b7e62af55470cf5f3ec3c56ec144a5e upstream.

If there is a device BTRFS_DEV_REPLACE_DEVID without the device replace
item, then it means the filesystem is inconsistent state. This is either
corruption or a crafted image.  Fail the mount as this needs a closer
look what is actually wrong.

As of now if BTRFS_DEV_REPLACE_DEVID is present without the replace
item, in __btrfs_free_extra_devids() we determine that there is an
extra device, and free those extra devices but continue to mount the
device.
However, we were wrong in keeping tack of the rw_devices so the syzbot
testcase failed:

  WARNING: CPU: 1 PID: 3612 at fs/btrfs/volumes.c:1166 close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
  Kernel panic - not syncing: panic_on_warn set ...
  CPU: 1 PID: 3612 Comm: syz-executor.2 Not tainted 5.9.0-rc4-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  Call Trace:
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0x198/0x1fd lib/dump_stack.c:118
   panic+0x347/0x7c0 kernel/panic.c:231
   __warn.cold+0x20/0x46 kernel/panic.c:600
   report_bug+0x1bd/0x210 lib/bug.c:198
   handle_bug+0x38/0x90 arch/x86/kernel/traps.c:234
   exc_invalid_op+0x14/0x40 arch/x86/kernel/traps.c:254
   asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:536
  RIP: 0010:close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
  RSP: 0018:ffffc900091777e0 EFLAGS: 00010246
  RAX: 0000000000040000 RBX: ffffffffffffffff RCX: ffffc9000c8b7000
  RDX: 0000000000040000 RSI: ffffffff83097f47 RDI: 0000000000000007
  RBP: dffffc0000000000 R08: 0000000000000001 R09: ffff8880988a187f
  R10: 0000000000000000 R11: 0000000000000001 R12: ffff88809593a130
  R13: ffff88809593a1ec R14: ffff8880988a1908 R15: ffff88809593a050
   close_fs_devices fs/btrfs/volumes.c:1193 [inline]
   btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
   open_ctree+0x4984/0x4a2d fs/btrfs/disk-io.c:3434
   btrfs_fill_super fs/btrfs/super.c:1316 [inline]
   btrfs_mount_root.cold+0x14/0x165 fs/btrfs/super.c:1672

The fix here is, when we determine that there isn't a replace item
then fail the mount if there is a replace target device (devid 0).

CC: stable@vger.kernel.org # 4.19+
Reported-by: syzbot+4cfe71a4da060be47502@syzkaller.appspotmail.com
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18 19:20:29 +01:00
Josef Bacik
0ee771e969 btrfs: sysfs: init devices outside of the chunk_mutex
[ Upstream commit ca10845a56856fff4de3804c85e6424d0f6d0cde ]

While running btrfs/061, btrfs/073, btrfs/078, or btrfs/178 we hit the
following lockdep splat:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.9.0-rc3+ #4 Not tainted
  ------------------------------------------------------
  kswapd0/100 is trying to acquire lock:
  ffff96ecc22ef4a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330

  but task is already holding lock:
  ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #3 (fs_reclaim){+.+.}-{0:0}:
	 fs_reclaim_acquire+0x65/0x80
	 slab_pre_alloc_hook.constprop.0+0x20/0x200
	 kmem_cache_alloc+0x37/0x270
	 alloc_inode+0x82/0xb0
	 iget_locked+0x10d/0x2c0
	 kernfs_get_inode+0x1b/0x130
	 kernfs_get_tree+0x136/0x240
	 sysfs_get_tree+0x16/0x40
	 vfs_get_tree+0x28/0xc0
	 path_mount+0x434/0xc00
	 __x64_sys_mount+0xe3/0x120
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #2 (kernfs_mutex){+.+.}-{3:3}:
	 __mutex_lock+0x7e/0x7e0
	 kernfs_add_one+0x23/0x150
	 kernfs_create_link+0x63/0xa0
	 sysfs_do_create_link_sd+0x5e/0xd0
	 btrfs_sysfs_add_devices_dir+0x81/0x130
	 btrfs_init_new_device+0x67f/0x1250
	 btrfs_ioctl+0x1ef/0x2e20
	 __x64_sys_ioctl+0x83/0xb0
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
	 __mutex_lock+0x7e/0x7e0
	 btrfs_chunk_alloc+0x125/0x3a0
	 find_free_extent+0xdf6/0x1210
	 btrfs_reserve_extent+0xb3/0x1b0
	 btrfs_alloc_tree_block+0xb0/0x310
	 alloc_tree_block_no_bg_flush+0x4a/0x60
	 __btrfs_cow_block+0x11a/0x530
	 btrfs_cow_block+0x104/0x220
	 btrfs_search_slot+0x52e/0x9d0
	 btrfs_insert_empty_items+0x64/0xb0
	 btrfs_insert_delayed_items+0x90/0x4f0
	 btrfs_commit_inode_delayed_items+0x93/0x140
	 btrfs_log_inode+0x5de/0x2020
	 btrfs_log_inode_parent+0x429/0xc90
	 btrfs_log_new_name+0x95/0x9b
	 btrfs_rename2+0xbb9/0x1800
	 vfs_rename+0x64f/0x9f0
	 do_renameat2+0x320/0x4e0
	 __x64_sys_rename+0x1f/0x30
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
	 __lock_acquire+0x119c/0x1fc0
	 lock_acquire+0xa7/0x3d0
	 __mutex_lock+0x7e/0x7e0
	 __btrfs_release_delayed_node.part.0+0x3f/0x330
	 btrfs_evict_inode+0x24c/0x500
	 evict+0xcf/0x1f0
	 dispose_list+0x48/0x70
	 prune_icache_sb+0x44/0x50
	 super_cache_scan+0x161/0x1e0
	 do_shrink_slab+0x178/0x3c0
	 shrink_slab+0x17c/0x290
	 shrink_node+0x2b2/0x6d0
	 balance_pgdat+0x30a/0x670
	 kswapd+0x213/0x4c0
	 kthread+0x138/0x160
	 ret_from_fork+0x1f/0x30

  other info that might help us debug this:

  Chain exists of:
    &delayed_node->mutex --> kernfs_mutex --> fs_reclaim

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(fs_reclaim);
				 lock(kernfs_mutex);
				 lock(fs_reclaim);
    lock(&delayed_node->mutex);

   *** DEADLOCK ***

  3 locks held by kswapd0/100:
   #0: ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
   #1: ffffffff8dd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
   #2: ffff96ed2ade30e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0

  stack backtrace:
  CPU: 0 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #4
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  Call Trace:
   dump_stack+0x8b/0xb8
   check_noncircular+0x12d/0x150
   __lock_acquire+0x119c/0x1fc0
   lock_acquire+0xa7/0x3d0
   ? __btrfs_release_delayed_node.part.0+0x3f/0x330
   __mutex_lock+0x7e/0x7e0
   ? __btrfs_release_delayed_node.part.0+0x3f/0x330
   ? __btrfs_release_delayed_node.part.0+0x3f/0x330
   ? lock_acquire+0xa7/0x3d0
   ? find_held_lock+0x2b/0x80
   __btrfs_release_delayed_node.part.0+0x3f/0x330
   btrfs_evict_inode+0x24c/0x500
   evict+0xcf/0x1f0
   dispose_list+0x48/0x70
   prune_icache_sb+0x44/0x50
   super_cache_scan+0x161/0x1e0
   do_shrink_slab+0x178/0x3c0
   shrink_slab+0x17c/0x290
   shrink_node+0x2b2/0x6d0
   balance_pgdat+0x30a/0x670
   kswapd+0x213/0x4c0
   ? _raw_spin_unlock_irqrestore+0x41/0x50
   ? add_wait_queue_exclusive+0x70/0x70
   ? balance_pgdat+0x670/0x670
   kthread+0x138/0x160
   ? kthread_create_worker_on_cpu+0x40/0x40
   ret_from_fork+0x1f/0x30

This happens because we are holding the chunk_mutex at the time of
adding in a new device.  However we only need to hold the
device_list_mutex, as we're going to iterate over the fs_devices
devices.  Move the sysfs init stuff outside of the chunk_mutex to get
rid of this lockdep splat.

CC: stable@vger.kernel.org # 4.4.x: f3cd2c58110dad14e: btrfs: sysfs, rename device_link add/remove functions
CC: stable@vger.kernel.org # 4.4.x
Reported-by: David Sterba <dsterba@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-18 19:20:16 +01:00
Filipe Manana
c964d386e8 btrfs: fix readahead hang and use-after-free after removing a device
commit 66d204a16c94f24ad08290a7663ab67e7fc04e82 upstream.

Very sporadically I had test case btrfs/069 from fstests hanging (for
years, it is not a recent regression), with the following traces in
dmesg/syslog:

  [162301.160628] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg started
  [162301.181196] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
  [162301.287162] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg finished
  [162513.513792] INFO: task btrfs-transacti:1356167 blocked for more than 120 seconds.
  [162513.514318]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.514522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.514747] task:btrfs-transacti state:D stack:    0 pid:1356167 ppid:     2 flags:0x00004000
  [162513.514751] Call Trace:
  [162513.514761]  __schedule+0x5ce/0xd00
  [162513.514765]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.514771]  schedule+0x46/0xf0
  [162513.514844]  wait_current_trans+0xde/0x140 [btrfs]
  [162513.514850]  ? finish_wait+0x90/0x90
  [162513.514864]  start_transaction+0x37c/0x5f0 [btrfs]
  [162513.514879]  transaction_kthread+0xa4/0x170 [btrfs]
  [162513.514891]  ? btrfs_cleanup_transaction+0x660/0x660 [btrfs]
  [162513.514894]  kthread+0x153/0x170
  [162513.514897]  ? kthread_stop+0x2c0/0x2c0
  [162513.514902]  ret_from_fork+0x22/0x30
  [162513.514916] INFO: task fsstress:1356184 blocked for more than 120 seconds.
  [162513.515192]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.515431] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.515680] task:fsstress        state:D stack:    0 pid:1356184 ppid:1356177 flags:0x00004000
  [162513.515682] Call Trace:
  [162513.515688]  __schedule+0x5ce/0xd00
  [162513.515691]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.515697]  schedule+0x46/0xf0
  [162513.515712]  wait_current_trans+0xde/0x140 [btrfs]
  [162513.515716]  ? finish_wait+0x90/0x90
  [162513.515729]  start_transaction+0x37c/0x5f0 [btrfs]
  [162513.515743]  btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
  [162513.515753]  btrfs_sync_fs+0x61/0x1c0 [btrfs]
  [162513.515758]  ? __ia32_sys_fdatasync+0x20/0x20
  [162513.515761]  iterate_supers+0x87/0xf0
  [162513.515765]  ksys_sync+0x60/0xb0
  [162513.515768]  __do_sys_sync+0xa/0x10
  [162513.515771]  do_syscall_64+0x33/0x80
  [162513.515774]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [162513.515781] RIP: 0033:0x7f5238f50bd7
  [162513.515782] Code: Bad RIP value.
  [162513.515784] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
  [162513.515786] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
  [162513.515788] RDX: 00000000ffffffff RSI: 000000000daf0e74 RDI: 000000000000003a
  [162513.515789] RBP: 0000000000000032 R08: 000000000000000a R09: 00007f5239019be0
  [162513.515791] R10: fffffffffffff24f R11: 0000000000000206 R12: 000000000000003a
  [162513.515792] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
  [162513.515804] INFO: task fsstress:1356185 blocked for more than 120 seconds.
  [162513.516064]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.516329] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.516617] task:fsstress        state:D stack:    0 pid:1356185 ppid:1356177 flags:0x00000000
  [162513.516620] Call Trace:
  [162513.516625]  __schedule+0x5ce/0xd00
  [162513.516628]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.516634]  schedule+0x46/0xf0
  [162513.516647]  wait_current_trans+0xde/0x140 [btrfs]
  [162513.516650]  ? finish_wait+0x90/0x90
  [162513.516662]  start_transaction+0x4d7/0x5f0 [btrfs]
  [162513.516679]  btrfs_setxattr_trans+0x3c/0x100 [btrfs]
  [162513.516686]  __vfs_setxattr+0x66/0x80
  [162513.516691]  __vfs_setxattr_noperm+0x70/0x200
  [162513.516697]  vfs_setxattr+0x6b/0x120
  [162513.516703]  setxattr+0x125/0x240
  [162513.516709]  ? lock_acquire+0xb1/0x480
  [162513.516712]  ? mnt_want_write+0x20/0x50
  [162513.516721]  ? rcu_read_lock_any_held+0x8e/0xb0
  [162513.516723]  ? preempt_count_add+0x49/0xa0
  [162513.516725]  ? __sb_start_write+0x19b/0x290
  [162513.516727]  ? preempt_count_add+0x49/0xa0
  [162513.516732]  path_setxattr+0xba/0xd0
  [162513.516739]  __x64_sys_setxattr+0x27/0x30
  [162513.516741]  do_syscall_64+0x33/0x80
  [162513.516743]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [162513.516745] RIP: 0033:0x7f5238f56d5a
  [162513.516746] Code: Bad RIP value.
  [162513.516748] RSP: 002b:00007fff67b97868 EFLAGS: 00000202 ORIG_RAX: 00000000000000bc
  [162513.516750] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f5238f56d5a
  [162513.516751] RDX: 000055b1fbb0d5a0 RSI: 00007fff67b978a0 RDI: 000055b1fbb0d470
  [162513.516753] RBP: 000055b1fbb0d5a0 R08: 0000000000000001 R09: 00007fff67b97700
  [162513.516754] R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000004
  [162513.516756] R13: 0000000000000024 R14: 0000000000000001 R15: 00007fff67b978a0
  [162513.516767] INFO: task fsstress:1356196 blocked for more than 120 seconds.
  [162513.517064]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.517365] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.517763] task:fsstress        state:D stack:    0 pid:1356196 ppid:1356177 flags:0x00004000
  [162513.517780] Call Trace:
  [162513.517786]  __schedule+0x5ce/0xd00
  [162513.517789]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.517796]  schedule+0x46/0xf0
  [162513.517810]  wait_current_trans+0xde/0x140 [btrfs]
  [162513.517814]  ? finish_wait+0x90/0x90
  [162513.517829]  start_transaction+0x37c/0x5f0 [btrfs]
  [162513.517845]  btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
  [162513.517857]  btrfs_sync_fs+0x61/0x1c0 [btrfs]
  [162513.517862]  ? __ia32_sys_fdatasync+0x20/0x20
  [162513.517865]  iterate_supers+0x87/0xf0
  [162513.517869]  ksys_sync+0x60/0xb0
  [162513.517872]  __do_sys_sync+0xa/0x10
  [162513.517875]  do_syscall_64+0x33/0x80
  [162513.517878]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [162513.517881] RIP: 0033:0x7f5238f50bd7
  [162513.517883] Code: Bad RIP value.
  [162513.517885] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
  [162513.517887] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
  [162513.517889] RDX: 0000000000000000 RSI: 000000007660add2 RDI: 0000000000000053
  [162513.517891] RBP: 0000000000000032 R08: 0000000000000067 R09: 00007f5239019be0
  [162513.517893] R10: fffffffffffff24f R11: 0000000000000206 R12: 0000000000000053
  [162513.517895] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
  [162513.517908] INFO: task fsstress:1356197 blocked for more than 120 seconds.
  [162513.518298]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.518672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.519157] task:fsstress        state:D stack:    0 pid:1356197 ppid:1356177 flags:0x00000000
  [162513.519160] Call Trace:
  [162513.519165]  __schedule+0x5ce/0xd00
  [162513.519168]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.519174]  schedule+0x46/0xf0
  [162513.519190]  wait_current_trans+0xde/0x140 [btrfs]
  [162513.519193]  ? finish_wait+0x90/0x90
  [162513.519206]  start_transaction+0x4d7/0x5f0 [btrfs]
  [162513.519222]  btrfs_create+0x57/0x200 [btrfs]
  [162513.519230]  lookup_open+0x522/0x650
  [162513.519246]  path_openat+0x2b8/0xa50
  [162513.519270]  do_filp_open+0x91/0x100
  [162513.519275]  ? find_held_lock+0x32/0x90
  [162513.519280]  ? lock_acquired+0x33b/0x470
  [162513.519285]  ? do_raw_spin_unlock+0x4b/0xc0
  [162513.519287]  ? _raw_spin_unlock+0x29/0x40
  [162513.519295]  do_sys_openat2+0x20d/0x2d0
  [162513.519300]  do_sys_open+0x44/0x80
  [162513.519304]  do_syscall_64+0x33/0x80
  [162513.519307]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [162513.519309] RIP: 0033:0x7f5238f4a903
  [162513.519310] Code: Bad RIP value.
  [162513.519312] RSP: 002b:00007fff67b97758 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
  [162513.519314] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f5238f4a903
  [162513.519316] RDX: 0000000000000000 RSI: 00000000000001b6 RDI: 000055b1fbb0d470
  [162513.519317] RBP: 00007fff67b978c0 R08: 0000000000000001 R09: 0000000000000002
  [162513.519319] R10: 00007fff67b974f7 R11: 0000000000000246 R12: 0000000000000013
  [162513.519320] R13: 00000000000001b6 R14: 00007fff67b97906 R15: 000055b1fad1c620
  [162513.519332] INFO: task btrfs:1356211 blocked for more than 120 seconds.
  [162513.519727]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.520115] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.520508] task:btrfs           state:D stack:    0 pid:1356211 ppid:1356178 flags:0x00004002
  [162513.520511] Call Trace:
  [162513.520516]  __schedule+0x5ce/0xd00
  [162513.520519]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.520525]  schedule+0x46/0xf0
  [162513.520544]  btrfs_scrub_pause+0x11f/0x180 [btrfs]
  [162513.520548]  ? finish_wait+0x90/0x90
  [162513.520562]  btrfs_commit_transaction+0x45a/0xc30 [btrfs]
  [162513.520574]  ? start_transaction+0xe0/0x5f0 [btrfs]
  [162513.520596]  btrfs_dev_replace_finishing+0x6d8/0x711 [btrfs]
  [162513.520619]  btrfs_dev_replace_by_ioctl.cold+0x1cc/0x1fd [btrfs]
  [162513.520639]  btrfs_ioctl+0x2a25/0x36f0 [btrfs]
  [162513.520643]  ? do_sigaction+0xf3/0x240
  [162513.520645]  ? find_held_lock+0x32/0x90
  [162513.520648]  ? do_sigaction+0xf3/0x240
  [162513.520651]  ? lock_acquired+0x33b/0x470
  [162513.520655]  ? _raw_spin_unlock_irq+0x24/0x50
  [162513.520657]  ? lockdep_hardirqs_on+0x7d/0x100
  [162513.520660]  ? _raw_spin_unlock_irq+0x35/0x50
  [162513.520662]  ? do_sigaction+0xf3/0x240
  [162513.520671]  ? __x64_sys_ioctl+0x83/0xb0
  [162513.520672]  __x64_sys_ioctl+0x83/0xb0
  [162513.520677]  do_syscall_64+0x33/0x80
  [162513.520679]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [162513.520681] RIP: 0033:0x7fc3cd307d87
  [162513.520682] Code: Bad RIP value.
  [162513.520684] RSP: 002b:00007ffe30a56bb8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
  [162513.520686] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fc3cd307d87
  [162513.520687] RDX: 00007ffe30a57a30 RSI: 00000000ca289435 RDI: 0000000000000003
  [162513.520689] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
  [162513.520690] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
  [162513.520692] R13: 0000557323a212e0 R14: 00007ffe30a5a520 R15: 0000000000000001
  [162513.520703]
		  Showing all locks held in the system:
  [162513.520712] 1 lock held by khungtaskd/54:
  [162513.520713]  #0: ffffffffb40a91a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x197
  [162513.520728] 1 lock held by in:imklog/596:
  [162513.520729]  #0: ffff8f3f0d781400 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
  [162513.520782] 1 lock held by btrfs-transacti/1356167:
  [162513.520784]  #0: ffff8f3d810cc848 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x4a/0x170 [btrfs]
  [162513.520798] 1 lock held by btrfs/1356190:
  [162513.520800]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0x60
  [162513.520805] 1 lock held by fsstress/1356184:
  [162513.520806]  #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
  [162513.520811] 3 locks held by fsstress/1356185:
  [162513.520812]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
  [162513.520815]  #1: ffff8f3d80a650b8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: vfs_setxattr+0x50/0x120
  [162513.520820]  #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
  [162513.520833] 1 lock held by fsstress/1356196:
  [162513.520834]  #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
  [162513.520838] 3 locks held by fsstress/1356197:
  [162513.520839]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
  [162513.520843]  #1: ffff8f3d506465e8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: path_openat+0x2a7/0xa50
  [162513.520846]  #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
  [162513.520858] 2 locks held by btrfs/1356211:
  [162513.520859]  #0: ffff8f3d810cde30 (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.}-{3:3}, at: btrfs_dev_replace_finishing+0x52/0x711 [btrfs]
  [162513.520877]  #1: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]

This was weird because the stack traces show that a transaction commit,
triggered by a device replace operation, is blocking trying to pause any
running scrubs but there are no stack traces of blocked tasks doing a
scrub.

After poking around with drgn, I noticed there was a scrub task that was
constantly running and blocking for shorts periods of time:

  >>> t = find_task(prog, 1356190)
  >>> prog.stack_trace(t)
  #0  __schedule+0x5ce/0xcfc
  #1  schedule+0x46/0xe4
  #2  schedule_timeout+0x1df/0x475
  #3  btrfs_reada_wait+0xda/0x132
  #4  scrub_stripe+0x2a8/0x112f
  #5  scrub_chunk+0xcd/0x134
  #6  scrub_enumerate_chunks+0x29e/0x5ee
  #7  btrfs_scrub_dev+0x2d5/0x91b
  #8  btrfs_ioctl+0x7f5/0x36e7
  #9  __x64_sys_ioctl+0x83/0xb0
  #10 do_syscall_64+0x33/0x77
  #11 entry_SYSCALL_64+0x7c/0x156

Which corresponds to:

int btrfs_reada_wait(void *handle)
{
    struct reada_control *rc = handle;
    struct btrfs_fs_info *fs_info = rc->fs_info;

    while (atomic_read(&rc->elems)) {
        if (!atomic_read(&fs_info->reada_works_cnt))
            reada_start_machine(fs_info);
        wait_event_timeout(rc->wait, atomic_read(&rc->elems) == 0,
                          (HZ + 9) / 10);
    }
(...)

So the counter "rc->elems" was set to 1 and never decreased to 0, causing
the scrub task to loop forever in that function. Then I used the following
script for drgn to check the readahead requests:

  $ cat dump_reada.py
  import sys
  import drgn
  from drgn import NULL, Object, cast, container_of, execscript, \
      reinterpret, sizeof
  from drgn.helpers.linux import *

  mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"

  mnt = None
  for mnt in for_each_mount(prog, dst = mnt_path):
      pass

  if mnt is None:
      sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
      sys.exit(1)

  fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)

  def dump_re(re):
      nzones = re.nzones.value_()
      print(f're at {hex(re.value_())}')
      print(f'\t logical {re.logical.value_()}')
      print(f'\t refcnt {re.refcnt.value_()}')
      print(f'\t nzones {nzones}')
      for i in range(nzones):
          dev = re.zones[i].device
          name = dev.name.str.string_()
          print(f'\t\t dev id {dev.devid.value_()} name {name}')
      print()

  for _, e in radix_tree_for_each(fs_info.reada_tree):
      re = cast('struct reada_extent *', e)
      dump_re(re)

  $ drgn dump_reada.py
  re at 0xffff8f3da9d25ad8
          logical 38928384
          refcnt 1
          nzones 1
                 dev id 0 name b'/dev/sdd'
  $

So there was one readahead extent with a single zone corresponding to the
source device of that last device replace operation logged in dmesg/syslog.
Also the ID of that zone's device was 0 which is a special value set in
the source device of a device replace operation when the operation finishes
(constant BTRFS_DEV_REPLACE_DEVID set at btrfs_dev_replace_finishing()),
confirming again that device /dev/sdd was the source of a device replace
operation.

Normally there should be as many zones in the readahead extent as there are
devices, and I wasn't expecting the extent to be in a block group with a
'single' profile, so I went and confirmed with the following drgn script
that there weren't any single profile block groups:

  $ cat dump_block_groups.py
  import sys
  import drgn
  from drgn import NULL, Object, cast, container_of, execscript, \
      reinterpret, sizeof
  from drgn.helpers.linux import *

  mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"

  mnt = None
  for mnt in for_each_mount(prog, dst = mnt_path):
      pass

  if mnt is None:
      sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
      sys.exit(1)

  fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)

  BTRFS_BLOCK_GROUP_DATA = (1 << 0)
  BTRFS_BLOCK_GROUP_SYSTEM = (1 << 1)
  BTRFS_BLOCK_GROUP_METADATA = (1 << 2)
  BTRFS_BLOCK_GROUP_RAID0 = (1 << 3)
  BTRFS_BLOCK_GROUP_RAID1 = (1 << 4)
  BTRFS_BLOCK_GROUP_DUP = (1 << 5)
  BTRFS_BLOCK_GROUP_RAID10 = (1 << 6)
  BTRFS_BLOCK_GROUP_RAID5 = (1 << 7)
  BTRFS_BLOCK_GROUP_RAID6 = (1 << 8)
  BTRFS_BLOCK_GROUP_RAID1C3 = (1 << 9)
  BTRFS_BLOCK_GROUP_RAID1C4 = (1 << 10)

  def bg_flags_string(bg):
      flags = bg.flags.value_()
      ret = ''
      if flags & BTRFS_BLOCK_GROUP_DATA:
          ret = 'data'
      if flags & BTRFS_BLOCK_GROUP_METADATA:
          if len(ret) > 0:
              ret += '|'
          ret += 'meta'
      if flags & BTRFS_BLOCK_GROUP_SYSTEM:
          if len(ret) > 0:
              ret += '|'
          ret += 'system'
      if flags & BTRFS_BLOCK_GROUP_RAID0:
          ret += ' raid0'
      elif flags & BTRFS_BLOCK_GROUP_RAID1:
          ret += ' raid1'
      elif flags & BTRFS_BLOCK_GROUP_DUP:
          ret += ' dup'
      elif flags & BTRFS_BLOCK_GROUP_RAID10:
          ret += ' raid10'
      elif flags & BTRFS_BLOCK_GROUP_RAID5:
          ret += ' raid5'
      elif flags & BTRFS_BLOCK_GROUP_RAID6:
          ret += ' raid6'
      elif flags & BTRFS_BLOCK_GROUP_RAID1C3:
          ret += ' raid1c3'
      elif flags & BTRFS_BLOCK_GROUP_RAID1C4:
          ret += ' raid1c4'
      else:
          ret += ' single'

      return ret

  def dump_bg(bg):
      print()
      print(f'block group at {hex(bg.value_())}')
      print(f'\t start {bg.start.value_()} length {bg.length.value_()}')
      print(f'\t flags {bg.flags.value_()} - {bg_flags_string(bg)}')

  bg_root = fs_info.block_group_cache_tree.address_of_()
  for bg in rbtree_inorder_for_each_entry('struct btrfs_block_group', bg_root, 'cache_node'):
      dump_bg(bg)

  $ drgn dump_block_groups.py

  block group at 0xffff8f3d673b0400
         start 22020096 length 16777216
         flags 258 - system raid6

  block group at 0xffff8f3d53ddb400
         start 38797312 length 536870912
         flags 260 - meta raid6

  block group at 0xffff8f3d5f4d9c00
         start 575668224 length 2147483648
         flags 257 - data raid6

  block group at 0xffff8f3d08189000
         start 2723151872 length 67108864
         flags 258 - system raid6

  block group at 0xffff8f3db70ff000
         start 2790260736 length 1073741824
         flags 260 - meta raid6

  block group at 0xffff8f3d5f4dd800
         start 3864002560 length 67108864
         flags 258 - system raid6

  block group at 0xffff8f3d67037000
         start 3931111424 length 2147483648
         flags 257 - data raid6
  $

So there were only 2 reasons left for having a readahead extent with a
single zone: reada_find_zone(), called when creating a readahead extent,
returned NULL either because we failed to find the corresponding block
group or because a memory allocation failed. With some additional and
custom tracing I figured out that on every further ocurrence of the
problem the block group had just been deleted when we were looping to
create the zones for the readahead extent (at reada_find_extent()), so we
ended up with only one zone in the readahead extent, corresponding to a
device that ends up getting replaced.

So after figuring that out it became obvious why the hang happens:

1) Task A starts a scrub on any device of the filesystem, except for
   device /dev/sdd;

2) Task B starts a device replace with /dev/sdd as the source device;

3) Task A calls btrfs_reada_add() from scrub_stripe() and it is currently
   starting to scrub a stripe from block group X. This call to
   btrfs_reada_add() is the one for the extent tree. When btrfs_reada_add()
   calls reada_add_block(), it passes the logical address of the extent
   tree's root node as its 'logical' argument - a value of 38928384;

4) Task A then enters reada_find_extent(), called from reada_add_block().
   It finds there isn't any existing readahead extent for the logical
   address 38928384, so it proceeds to the path of creating a new one.

   It calls btrfs_map_block() to find out which stripes exist for the block
   group X. On the first iteration of the for loop that iterates over the
   stripes, it finds the stripe for device /dev/sdd, so it creates one
   zone for that device and adds it to the readahead extent. Before getting
   into the second iteration of the loop, the cleanup kthread deletes block
   group X because it was empty. So in the iterations for the remaining
   stripes it does not add more zones to the readahead extent, because the
   calls to reada_find_zone() returned NULL because they couldn't find
   block group X anymore.

   As a result the new readahead extent has a single zone, corresponding to
   the device /dev/sdd;

4) Before task A returns to btrfs_reada_add() and queues the readahead job
   for the readahead work queue, task B finishes the device replace and at
   btrfs_dev_replace_finishing() swaps the device /dev/sdd with the new
   device /dev/sdg;

5) Task A returns to reada_add_block(), which increments the counter
   "->elems" of the reada_control structure allocated at btrfs_reada_add().

   Then it returns back to btrfs_reada_add() and calls
   reada_start_machine(). This queues a job in the readahead work queue to
   run the function reada_start_machine_worker(), which calls
   __reada_start_machine().

   At __reada_start_machine() we take the device list mutex and for each
   device found in the current device list, we call
   reada_start_machine_dev() to start the readahead work. However at this
   point the device /dev/sdd was already freed and is not in the device
   list anymore.

   This means the corresponding readahead for the extent at 38928384 is
   never started, and therefore the "->elems" counter of the reada_control
   structure allocated at btrfs_reada_add() never goes down to 0, causing
   the call to btrfs_reada_wait(), done by the scrub task, to wait forever.

Note that the readahead request can be made either after the device replace
started or before it started, however in pratice it is very unlikely that a
device replace is able to start after a readahead request is made and is
able to complete before the readahead request completes - maybe only on a
very small and nearly empty filesystem.

This hang however is not the only problem we can have with readahead and
device removals. When the readahead extent has other zones other than the
one corresponding to the device that is being removed (either by a device
replace or a device remove operation), we risk having a use-after-free on
the device when dropping the last reference of the readahead extent.

For example if we create a readahead extent with two zones, one for the
device /dev/sdd and one for the device /dev/sde:

1) Before the readahead worker starts, the device /dev/sdd is removed,
   and the corresponding btrfs_device structure is freed. However the
   readahead extent still has the zone pointing to the device structure;

2) When the readahead worker starts, it only finds device /dev/sde in the
   current device list of the filesystem;

3) It starts the readahead work, at reada_start_machine_dev(), using the
   device /dev/sde;

4) Then when it finishes reading the extent from device /dev/sde, it calls
   __readahead_hook() which ends up dropping the last reference on the
   readahead extent through the last call to reada_extent_put();

5) At reada_extent_put() it iterates over each zone of the readahead extent
   and attempts to delete an element from the device's 'reada_extents'
   radix tree, resulting in a use-after-free, as the device pointer of the
   zone for /dev/sdd is now stale. We can also access the device after
   dropping the last reference of a zone, through reada_zone_release(),
   also called by reada_extent_put().

And a device remove suffers the same problem, however since it shrinks the
device size down to zero before removing the device, it is very unlikely to
still have readahead requests not completed by the time we free the device,
the only possibility is if the device has a very little space allocated.

While the hang problem is exclusive to scrub, since it is currently the
only user of btrfs_reada_add() and btrfs_reada_wait(), the use-after-free
problem affects any path that triggers readhead, which includes
btree_readahead_hook() and __readahead_hook() (a readahead worker can
trigger readahed for the children of a node) for example - any path that
ends up calling reada_add_block() can trigger the use-after-free after a
device is removed.

So fix this by waiting for any readahead requests for a device to complete
before removing a device, ensuring that while waiting for existing ones no
new ones can be made.

This problem has been around for a very long time - the readahead code was
added in 2011, device remove exists since 2008 and device replace was
introduced in 2013, hard to pick a specific commit for a git Fixes tag.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05 11:43:27 +01:00
Anand Jain
223b462744 btrfs: improve device scanning messages
commit 79dae17d8d44b2d15779e332180080af45df5352 upstream.

Systems booting without the initramfs seems to scan an unusual kind
of device path (/dev/root). And at a later time, the device is updated
to the correct path. We generally print the process name and PID of the
process scanning the device but we don't capture the same information if
the device path is rescanned with a different pathname.

The current message is too long, so drop the unnecessary UUID and add
process name and PID.

While at this also update the duplicate device warning to include the
process name and PID so the messages are consistent

CC: stable@vger.kernel.org # 4.19+
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=89721
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05 11:43:26 +01:00
Qu Wenruo
1f90600e25 btrfs: Ensure we trim ranges across block group boundary
commit 6b7faadd985c990324b5b5bd18cc4ba5c395eb65 upstream.

[BUG]
When deleting large files (which cross block group boundary) with
discard mount option, we find some btrfs_discard_extent() calls only
trimmed part of its space, not the whole range:

  btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50%

type:		bbio->map_type, in above case, it's SINGLE DATA.
start:		Logical address of this trim
len:		Logical length of this trim
trimmed:	Physically trimmed bytes
ratio:		trimmed / len

Thus leaving some unused space not discarded.

[CAUSE]
When discard mount option is specified, after a transaction is fully
committed (super block written to disk), we begin to cleanup pinned
extents in the following call chain:

btrfs_commit_transaction()
|- btrfs_finish_extent_commit()
   |- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY);
   |- btrfs_discard_extent()

However, pinned extents are recorded in an extent_io_tree, which can
merge adjacent extent states.

When a large file gets deleted and it has adjacent file extents across
block group boundary, we will get a large merged range like this:

      |<---    BG1    --->|<---      BG2     --->|
      |//////|<--   Range to discard   --->|/////|

To discard that range, we have the following calls:

  btrfs_discard_extent()
  |- btrfs_map_block()
  |  Returned bbio will end at BG1's end. As btrfs_map_block()
  |  never returns result across block group boundary.
  |- btrfs_issuse_discard()
     Issue discard for each stripe.

So we will only discard the range in BG1, not the remaining part in BG2.

Furthermore, this bug is not that reliably observed, for above case, if
there is no other extent in BG2, BG2 will be empty and btrfs will trim
all space of BG2, covering up the bug.

[FIX]
- Allow __btrfs_map_block_for_discard() to modify @length parameter
  btrfs_map_block() uses its @length paramter to notify the caller how
  many bytes are mapped in current call.
  With __btrfs_map_block_for_discard() also modifing the @length,
  btrfs_discard_extent() now understands when to do extra trim.

- Call btrfs_map_block() in a loop until we hit the range end Since we
  now know how many bytes are mapped each time, we can iterate through
  each block group boundary and issue correct trim for each range.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-14 10:33:00 +02:00
Qu Wenruo
6a0f5da2db btrfs: volumes: Use more straightforward way to calculate map length
commit 2d974619a77f106f3d1341686dea95c0eaad601f upstream.

The old code goes:

 	offset = logical - em->start;
	length = min_t(u64, em->len - offset, length);

Where @length calculation is dependent on offset, it can take reader
several more seconds to find it's just the same code as:

 	offset = logical - em->start;
	length = min_t(u64, em->start + em->len - logical, length);

Use above code to make the length calculate independent from other
variable, thus slightly increase the readability.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-14 10:32:59 +02:00
Josef Bacik
524f3f3030 btrfs: fix lockdep splat in add_missing_dev
commit fccc0007b8dc952c6bc0805cdf842eb8ea06a639 upstream.

Nikolay reported a lockdep splat in generic/476 that I could reproduce
with btrfs/187.

  ======================================================
  WARNING: possible circular locking dependency detected
  5.9.0-rc2+ #1 Tainted: G        W
  ------------------------------------------------------
  kswapd0/100 is trying to acquire lock:
  ffff9e8ef38b6268 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330

  but task is already holding lock:
  ffffffffa9d74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #2 (fs_reclaim){+.+.}-{0:0}:
	 fs_reclaim_acquire+0x65/0x80
	 slab_pre_alloc_hook.constprop.0+0x20/0x200
	 kmem_cache_alloc_trace+0x3a/0x1a0
	 btrfs_alloc_device+0x43/0x210
	 add_missing_dev+0x20/0x90
	 read_one_chunk+0x301/0x430
	 btrfs_read_sys_array+0x17b/0x1b0
	 open_ctree+0xa62/0x1896
	 btrfs_mount_root.cold+0x12/0xea
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 vfs_kern_mount.part.0+0x71/0xb0
	 btrfs_mount+0x10d/0x379
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 path_mount+0x434/0xc00
	 __x64_sys_mount+0xe3/0x120
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
	 __mutex_lock+0x7e/0x7e0
	 btrfs_chunk_alloc+0x125/0x3a0
	 find_free_extent+0xdf6/0x1210
	 btrfs_reserve_extent+0xb3/0x1b0
	 btrfs_alloc_tree_block+0xb0/0x310
	 alloc_tree_block_no_bg_flush+0x4a/0x60
	 __btrfs_cow_block+0x11a/0x530
	 btrfs_cow_block+0x104/0x220
	 btrfs_search_slot+0x52e/0x9d0
	 btrfs_lookup_inode+0x2a/0x8f
	 __btrfs_update_delayed_inode+0x80/0x240
	 btrfs_commit_inode_delayed_inode+0x119/0x120
	 btrfs_evict_inode+0x357/0x500
	 evict+0xcf/0x1f0
	 vfs_rmdir.part.0+0x149/0x160
	 do_rmdir+0x136/0x1a0
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
	 __lock_acquire+0x1184/0x1fa0
	 lock_acquire+0xa4/0x3d0
	 __mutex_lock+0x7e/0x7e0
	 __btrfs_release_delayed_node.part.0+0x3f/0x330
	 btrfs_evict_inode+0x24c/0x500
	 evict+0xcf/0x1f0
	 dispose_list+0x48/0x70
	 prune_icache_sb+0x44/0x50
	 super_cache_scan+0x161/0x1e0
	 do_shrink_slab+0x178/0x3c0
	 shrink_slab+0x17c/0x290
	 shrink_node+0x2b2/0x6d0
	 balance_pgdat+0x30a/0x670
	 kswapd+0x213/0x4c0
	 kthread+0x138/0x160
	 ret_from_fork+0x1f/0x30

  other info that might help us debug this:

  Chain exists of:
    &delayed_node->mutex --> &fs_info->chunk_mutex --> fs_reclaim

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(fs_reclaim);
				 lock(&fs_info->chunk_mutex);
				 lock(fs_reclaim);
    lock(&delayed_node->mutex);

   *** DEADLOCK ***

  3 locks held by kswapd0/100:
   #0: ffffffffa9d74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
   #1: ffffffffa9d65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
   #2: ffff9e8e9da260e0 (&type->s_umount_key#48){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0

  stack backtrace:
  CPU: 1 PID: 100 Comm: kswapd0 Tainted: G        W         5.9.0-rc2+ #1
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  Call Trace:
   dump_stack+0x92/0xc8
   check_noncircular+0x12d/0x150
   __lock_acquire+0x1184/0x1fa0
   lock_acquire+0xa4/0x3d0
   ? __btrfs_release_delayed_node.part.0+0x3f/0x330
   __mutex_lock+0x7e/0x7e0
   ? __btrfs_release_delayed_node.part.0+0x3f/0x330
   ? __btrfs_release_delayed_node.part.0+0x3f/0x330
   ? lock_acquire+0xa4/0x3d0
   ? btrfs_evict_inode+0x11e/0x500
   ? find_held_lock+0x2b/0x80
   __btrfs_release_delayed_node.part.0+0x3f/0x330
   btrfs_evict_inode+0x24c/0x500
   evict+0xcf/0x1f0
   dispose_list+0x48/0x70
   prune_icache_sb+0x44/0x50
   super_cache_scan+0x161/0x1e0
   do_shrink_slab+0x178/0x3c0
   shrink_slab+0x17c/0x290
   shrink_node+0x2b2/0x6d0
   balance_pgdat+0x30a/0x670
   kswapd+0x213/0x4c0
   ? _raw_spin_unlock_irqrestore+0x46/0x60
   ? add_wait_queue_exclusive+0x70/0x70
   ? balance_pgdat+0x670/0x670
   kthread+0x138/0x160
   ? kthread_create_worker_on_cpu+0x40/0x40
   ret_from_fork+0x1f/0x30

This is because we are holding the chunk_mutex when we call
btrfs_alloc_device, which does a GFP_KERNEL allocation.  We don't want
to switch that to a GFP_NOFS lock because this is the only place where
it matters.  So instead use memalloc_nofs_save() around the allocation
in order to avoid the lockdep splat.

Reported-by: Nikolay Borisov <nborisov@suse.com>
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-17 13:47:51 +02:00
Josef Bacik
eb29635ba6 btrfs: drop path before adding new uuid tree entry
commit 9771a5cf937129307d9f58922d60484d58ababe7 upstream.

With the conversion of the tree locks to rwsem I got the following
lockdep splat:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925 Not tainted
  ------------------------------------------------------
  btrfs-uuid/7955 is trying to acquire lock:
  ffff88bfbafec0f8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180

  but task is already holding lock:
  ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #1 (btrfs-uuid-00){++++}-{3:3}:
	 down_read_nested+0x3e/0x140
	 __btrfs_tree_read_lock+0x39/0x180
	 __btrfs_read_lock_root_node+0x3a/0x50
	 btrfs_search_slot+0x4bd/0x990
	 btrfs_uuid_tree_add+0x89/0x2d0
	 btrfs_uuid_scan_kthread+0x330/0x390
	 kthread+0x133/0x150
	 ret_from_fork+0x1f/0x30

  -> #0 (btrfs-root-00){++++}-{3:3}:
	 __lock_acquire+0x1272/0x2310
	 lock_acquire+0x9e/0x360
	 down_read_nested+0x3e/0x140
	 __btrfs_tree_read_lock+0x39/0x180
	 __btrfs_read_lock_root_node+0x3a/0x50
	 btrfs_search_slot+0x4bd/0x990
	 btrfs_find_root+0x45/0x1b0
	 btrfs_read_tree_root+0x61/0x100
	 btrfs_get_root_ref.part.50+0x143/0x630
	 btrfs_uuid_tree_iterate+0x207/0x314
	 btrfs_uuid_rescan_kthread+0x12/0x50
	 kthread+0x133/0x150
	 ret_from_fork+0x1f/0x30

  other info that might help us debug this:

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(btrfs-uuid-00);
				 lock(btrfs-root-00);
				 lock(btrfs-uuid-00);
    lock(btrfs-root-00);

   *** DEADLOCK ***

  1 lock held by btrfs-uuid/7955:
   #0: ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180

  stack backtrace:
  CPU: 73 PID: 7955 Comm: btrfs-uuid Kdump: loaded Not tainted 5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925
  Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
  Call Trace:
   dump_stack+0x78/0xa0
   check_noncircular+0x165/0x180
   __lock_acquire+0x1272/0x2310
   lock_acquire+0x9e/0x360
   ? __btrfs_tree_read_lock+0x39/0x180
   ? btrfs_root_node+0x1c/0x1d0
   down_read_nested+0x3e/0x140
   ? __btrfs_tree_read_lock+0x39/0x180
   __btrfs_tree_read_lock+0x39/0x180
   __btrfs_read_lock_root_node+0x3a/0x50
   btrfs_search_slot+0x4bd/0x990
   btrfs_find_root+0x45/0x1b0
   btrfs_read_tree_root+0x61/0x100
   btrfs_get_root_ref.part.50+0x143/0x630
   btrfs_uuid_tree_iterate+0x207/0x314
   ? btree_readpage+0x20/0x20
   btrfs_uuid_rescan_kthread+0x12/0x50
   kthread+0x133/0x150
   ? kthread_create_on_node+0x60/0x60
   ret_from_fork+0x1f/0x30

This problem exists because we have two different rescan threads,
btrfs_uuid_scan_kthread which creates the uuid tree, and
btrfs_uuid_tree_iterate that goes through and updates or deletes any out
of date roots.  The problem is they both do things in different order.
btrfs_uuid_scan_kthread() reads the tree_root, and then inserts entries
into the uuid_root.  btrfs_uuid_tree_iterate() scans the uuid_root, but
then does a btrfs_get_fs_root() which can read from the tree_root.

It's actually easy enough to not be holding the path in
btrfs_uuid_scan_kthread() when we add a uuid entry, as we already drop
it further down and re-start the search when we loop.  So simply move
the path release before we add our entry to the uuid tree.

This also fixes a problem where we're holding a path open after we do
btrfs_end_transaction(), which has it's own problems.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-09 19:12:30 +02:00
Qu Wenruo
38ab14b1e2 btrfs: relocation: review the call sites which can be interrupted by signal
commit 44d354abf33e92a5e73b965c84caf5a5d5e58a0b upstream.

Since most metadata reservation calls can return -EINTR when get
interrupted by fatal signal, we need to review the all the metadata
reservation call sites.

In relocation code, the metadata reservation happens in the following
sites:

- btrfs_block_rsv_refill() in merge_reloc_root()
  merge_reloc_root() is a pretty critical section, we don't want to be
  interrupted by signal, so change the flush status to
  BTRFS_RESERVE_FLUSH_LIMIT, so it won't get interrupted by signal.
  Since such change can be ENPSPC-prone, also shrink the amount of
  metadata to reserve least amount avoid deadly ENOSPC there.

- btrfs_block_rsv_refill() in reserve_metadata_space()
  It calls with BTRFS_RESERVE_FLUSH_LIMIT, which won't get interrupted
  by signal.

- btrfs_block_rsv_refill() in prepare_to_relocate()

- btrfs_block_rsv_add() in prepare_to_relocate()

- btrfs_block_rsv_refill() in relocate_block_group()

- btrfs_delalloc_reserve_metadata() in relocate_file_extent_cluster()

- btrfs_start_transaction() in relocate_block_group()

- btrfs_start_transaction() in create_reloc_inode()
  Can be interrupted by fatal signal and we can handle it easily.
  For these call sites, just catch the -EINTR value in btrfs_balance()
  and count them as canceled.

CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-21 13:05:22 +02:00
Josef Bacik
ae3f93cafd btrfs: move the chunk_mutex in btrfs_read_chunk_tree
commit 01d01caf19ff7c537527d352d169c4368375c0a1 upstream.

We are currently getting this lockdep splat in btrfs/161:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.8.0-rc5+ #20 Tainted: G            E
  ------------------------------------------------------
  mount/678048 is trying to acquire lock:
  ffff9b769f15b6e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: clone_fs_devices+0x4d/0x170 [btrfs]

  but task is already holding lock:
  ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
	 __mutex_lock+0x8b/0x8f0
	 btrfs_init_new_device+0x2d2/0x1240 [btrfs]
	 btrfs_ioctl+0x1de/0x2d20 [btrfs]
	 ksys_ioctl+0x87/0xc0
	 __x64_sys_ioctl+0x16/0x20
	 do_syscall_64+0x52/0xb0
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
	 __lock_acquire+0x1240/0x2460
	 lock_acquire+0xab/0x360
	 __mutex_lock+0x8b/0x8f0
	 clone_fs_devices+0x4d/0x170 [btrfs]
	 btrfs_read_chunk_tree+0x330/0x800 [btrfs]
	 open_ctree+0xb7c/0x18ce [btrfs]
	 btrfs_mount_root.cold+0x13/0xfa [btrfs]
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 fc_mount+0xe/0x40
	 vfs_kern_mount.part.0+0x71/0x90
	 btrfs_mount+0x13b/0x3e0 [btrfs]
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 do_mount+0x7de/0xb30
	 __x64_sys_mount+0x8e/0xd0
	 do_syscall_64+0x52/0xb0
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  other info that might help us debug this:

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(&fs_info->chunk_mutex);
				 lock(&fs_devs->device_list_mutex);
				 lock(&fs_info->chunk_mutex);
    lock(&fs_devs->device_list_mutex);

   *** DEADLOCK ***

  3 locks held by mount/678048:
   #0: ffff9b75ff5fb0e0 (&type->s_umount_key#63/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380
   #1: ffffffffc0c2fbc8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x54/0x800 [btrfs]
   #2: ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]

  stack backtrace:
  CPU: 2 PID: 678048 Comm: mount Tainted: G            E     5.8.0-rc5+ #20
  Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
  Call Trace:
   dump_stack+0x96/0xd0
   check_noncircular+0x162/0x180
   __lock_acquire+0x1240/0x2460
   ? asm_sysvec_apic_timer_interrupt+0x12/0x20
   lock_acquire+0xab/0x360
   ? clone_fs_devices+0x4d/0x170 [btrfs]
   __mutex_lock+0x8b/0x8f0
   ? clone_fs_devices+0x4d/0x170 [btrfs]
   ? rcu_read_lock_sched_held+0x52/0x60
   ? cpumask_next+0x16/0x20
   ? module_assert_mutex_or_preempt+0x14/0x40
   ? __module_address+0x28/0xf0
   ? clone_fs_devices+0x4d/0x170 [btrfs]
   ? static_obj+0x4f/0x60
   ? lockdep_init_map_waits+0x43/0x200
   ? clone_fs_devices+0x4d/0x170 [btrfs]
   clone_fs_devices+0x4d/0x170 [btrfs]
   btrfs_read_chunk_tree+0x330/0x800 [btrfs]
   open_ctree+0xb7c/0x18ce [btrfs]
   ? super_setup_bdi_name+0x79/0xd0
   btrfs_mount_root.cold+0x13/0xfa [btrfs]
   ? vfs_parse_fs_string+0x84/0xb0
   ? rcu_read_lock_sched_held+0x52/0x60
   ? kfree+0x2b5/0x310
   legacy_get_tree+0x30/0x50
   vfs_get_tree+0x28/0xc0
   fc_mount+0xe/0x40
   vfs_kern_mount.part.0+0x71/0x90
   btrfs_mount+0x13b/0x3e0 [btrfs]
   ? cred_has_capability+0x7c/0x120
   ? rcu_read_lock_sched_held+0x52/0x60
   ? legacy_get_tree+0x30/0x50
   legacy_get_tree+0x30/0x50
   vfs_get_tree+0x28/0xc0
   do_mount+0x7de/0xb30
   ? memdup_user+0x4e/0x90
   __x64_sys_mount+0x8e/0xd0
   do_syscall_64+0x52/0xb0
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This is because btrfs_read_chunk_tree() can come upon DEV_EXTENT's and
then read the device, which takes the device_list_mutex.  The
device_list_mutex needs to be taken before the chunk_mutex, so this is a
problem.  We only really need the chunk mutex around adding the chunk,
so move the mutex around read_one_chunk.

An argument could be made that we don't even need the chunk_mutex here
as it's during mount, and we are protected by various other locks.
However we already have special rules for ->device_list_mutex, and I'd
rather not have another special case for ->chunk_mutex.

CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-21 13:05:22 +02:00
Josef Bacik
98f55cd496 btrfs: open device without device_list_mutex
commit 18c850fdc5a801bad4977b0f1723761d42267e45 upstream.

There's long existed a lockdep splat because we open our bdev's under
the ->device_list_mutex at mount time, which acquires the bd_mutex.
Usually this goes unnoticed, but if you do loopback devices at all
suddenly the bd_mutex comes with a whole host of other dependencies,
which results in the splat when you mount a btrfs file system.

======================================================
WARNING: possible circular locking dependency detected
5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted
------------------------------------------------------
systemd-journal/509 is trying to acquire lock:
ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs]

but task is already holding lock:
ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

 -> #6 (sb_pagefaults){.+.+}-{0:0}:
       __sb_start_write+0x13e/0x220
       btrfs_page_mkwrite+0x59/0x560 [btrfs]
       do_page_mkwrite+0x4f/0x130
       do_wp_page+0x3b0/0x4f0
       handle_mm_fault+0xf47/0x1850
       do_user_addr_fault+0x1fc/0x4b0
       exc_page_fault+0x88/0x300
       asm_exc_page_fault+0x1e/0x30

 -> #5 (&mm->mmap_lock#2){++++}-{3:3}:
       __might_fault+0x60/0x80
       _copy_from_user+0x20/0xb0
       get_sg_io_hdr+0x9a/0xb0
       scsi_cmd_ioctl+0x1ea/0x2f0
       cdrom_ioctl+0x3c/0x12b4
       sr_block_ioctl+0xa4/0xd0
       block_ioctl+0x3f/0x50
       ksys_ioctl+0x82/0xc0
       __x64_sys_ioctl+0x16/0x20
       do_syscall_64+0x52/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

 -> #4 (&cd->lock){+.+.}-{3:3}:
       __mutex_lock+0x7b/0x820
       sr_block_open+0xa2/0x180
       __blkdev_get+0xdd/0x550
       blkdev_get+0x38/0x150
       do_dentry_open+0x16b/0x3e0
       path_openat+0x3c9/0xa00
       do_filp_open+0x75/0x100
       do_sys_openat2+0x8a/0x140
       __x64_sys_openat+0x46/0x70
       do_syscall_64+0x52/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

 -> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
       __mutex_lock+0x7b/0x820
       __blkdev_get+0x6a/0x550
       blkdev_get+0x85/0x150
       blkdev_get_by_path+0x2c/0x70
       btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
       open_fs_devices+0x88/0x240 [btrfs]
       btrfs_open_devices+0x92/0xa0 [btrfs]
       btrfs_mount_root+0x250/0x490 [btrfs]
       legacy_get_tree+0x30/0x50
       vfs_get_tree+0x28/0xc0
       vfs_kern_mount.part.0+0x71/0xb0
       btrfs_mount+0x119/0x380 [btrfs]
       legacy_get_tree+0x30/0x50
       vfs_get_tree+0x28/0xc0
       do_mount+0x8c6/0xca0
       __x64_sys_mount+0x8e/0xd0
       do_syscall_64+0x52/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

 -> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
       __mutex_lock+0x7b/0x820
       btrfs_run_dev_stats+0x36/0x420 [btrfs]
       commit_cowonly_roots+0x91/0x2d0 [btrfs]
       btrfs_commit_transaction+0x4e6/0x9f0 [btrfs]
       btrfs_sync_file+0x38a/0x480 [btrfs]
       __x64_sys_fdatasync+0x47/0x80
       do_syscall_64+0x52/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

 -> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
       __mutex_lock+0x7b/0x820
       btrfs_commit_transaction+0x48e/0x9f0 [btrfs]
       btrfs_sync_file+0x38a/0x480 [btrfs]
       __x64_sys_fdatasync+0x47/0x80
       do_syscall_64+0x52/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

 -> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
       __lock_acquire+0x1241/0x20c0
       lock_acquire+0xb0/0x400
       __mutex_lock+0x7b/0x820
       btrfs_record_root_in_trans+0x44/0x70 [btrfs]
       start_transaction+0xd2/0x500 [btrfs]
       btrfs_dirty_inode+0x44/0xd0 [btrfs]
       file_update_time+0xc6/0x120
       btrfs_page_mkwrite+0xda/0x560 [btrfs]
       do_page_mkwrite+0x4f/0x130
       do_wp_page+0x3b0/0x4f0
       handle_mm_fault+0xf47/0x1850
       do_user_addr_fault+0x1fc/0x4b0
       exc_page_fault+0x88/0x300
       asm_exc_page_fault+0x1e/0x30

other info that might help us debug this:

Chain exists of:
  &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults

Possible unsafe locking scenario:

     CPU0                    CPU1
     ----                    ----
 lock(sb_pagefaults);
                             lock(&mm->mmap_lock#2);
                             lock(sb_pagefaults);
 lock(&fs_info->reloc_mutex);

 *** DEADLOCK ***

3 locks held by systemd-journal/509:
 #0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0
 #1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
 #2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs]

stack backtrace:
CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
 dump_stack+0x92/0xc8
 check_noncircular+0x134/0x150
 __lock_acquire+0x1241/0x20c0
 lock_acquire+0xb0/0x400
 ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
 ? lock_acquire+0xb0/0x400
 ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
 __mutex_lock+0x7b/0x820
 ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
 ? kvm_sched_clock_read+0x14/0x30
 ? sched_clock+0x5/0x10
 ? sched_clock_cpu+0xc/0xb0
 btrfs_record_root_in_trans+0x44/0x70 [btrfs]
 start_transaction+0xd2/0x500 [btrfs]
 btrfs_dirty_inode+0x44/0xd0 [btrfs]
 file_update_time+0xc6/0x120
 btrfs_page_mkwrite+0xda/0x560 [btrfs]
 ? sched_clock+0x5/0x10
 do_page_mkwrite+0x4f/0x130
 do_wp_page+0x3b0/0x4f0
 handle_mm_fault+0xf47/0x1850
 do_user_addr_fault+0x1fc/0x4b0
 exc_page_fault+0x88/0x300
 ? asm_exc_page_fault+0x8/0x30
 asm_exc_page_fault+0x1e/0x30
RIP: 0033:0x7fa3972fdbfe
Code: Bad RIP value.

Fix this by not holding the ->device_list_mutex at this point.  The
device_list_mutex exists to protect us from modifying the device list
while the file system is running.

However it can also be modified by doing a scan on a device.  But this
action is specifically protected by the uuid_mutex, which we are holding
here.  We cannot race with opening at this point because we have the
->s_mount lock held during the mount.  Not having the
->device_list_mutex here is perfectly safe as we're not going to change
the devices at this point.

CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add some comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-21 13:05:21 +02:00
David Sterba
207659ef15 btrfs: allow use of global block reserve for balance item deletion
commit 3502a8c0dc1bd4b4970b59b06e348f22a1c05581 upstream.

On a filesystem with exhausted metadata, but still enough to start
balance, it's possible to hit this error:

[324402.053842] BTRFS info (device loop0): 1 enospc errors during balance
[324402.060769] BTRFS info (device loop0): balance: ended with status: -28
[324402.172295] BTRFS: error (device loop0) in reset_balance_state:3321: errno=-28 No space left

It fails inside reset_balance_state and turns the filesystem to
read-only, which is unnecessary and should be fixed too, but the problem
is caused by lack for space when the balance item is deleted. This is a
one-time operation and from the same rank as unlink that is allowed to
use the global block reserve. So do the same for the balance item.

Status of the filesystem (100GiB) just after the balance fails:

$ btrfs fi df mnt
Data, single: total=80.01GiB, used=38.58GiB
System, single: total=4.00MiB, used=16.00KiB
Metadata, single: total=19.99GiB, used=19.48GiB
GlobalReserve, single: total=512.00MiB, used=50.11MiB

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-21 13:05:20 +02:00
Boris Burkov
b04805a7e8 btrfs: fix mount failure caused by race with umount
commit 48cfa61b58a1fee0bc49eef04f8ccf31493b7cdd upstream.

It is possible to cause a btrfs mount to fail by racing it with a slow
umount. The crux of the sequence is generic_shutdown_super not yet
calling sop->put_super before btrfs_mount_root calls btrfs_open_devices.
If that occurs, btrfs_open_devices will decide the opened counter is
non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to
0. From here, mount will call sget which will result in grab_super
trying to take the super block umount semaphore. That semaphore will be
held by the slow umount, so mount will block. Before up-ing the
semaphore, umount will delete the super block, resulting in mount's sget
reliably allocating a new one, which causes the mount path to dutifully
fill it out, and increment total_rw_bytes a second time, which causes
the mount to fail, as we see double the expected bytes.

Here is the sequence laid out in greater detail:

CPU0                                                    CPU1
down_write sb->s_umount
btrfs_kill_super
  kill_anon_super(sb)
    generic_shutdown_super(sb);
      shrink_dcache_for_umount(sb);
      sync_filesystem(sb);
      evict_inodes(sb); // SLOW

                                              btrfs_mount_root
                                                btrfs_scan_one_device
                                                fs_devices = device->fs_devices
                                                fs_info->fs_devices = fs_devices
                                                // fs_devices-opened makes this a no-op
                                                btrfs_open_devices(fs_devices, mode, fs_type)
                                                s = sget(fs_type, test, set, flags, fs_info);
                                                  find sb in s_instances
                                                  grab_super(sb);
                                                    down_write(&s->s_umount); // blocks

      sop->put_super(sb)
        // sb->fs_devices->opened == 2; no-op
      spin_lock(&sb_lock);
      hlist_del_init(&sb->s_instances);
      spin_unlock(&sb_lock);
      up_write(&sb->s_umount);
                                                    return 0;
                                                  retry lookup
                                                  don't find sb in s_instances (deleted by CPU0)
                                                  s = alloc_super
                                                  return s;
                                                btrfs_fill_super(s, fs_devices, data)
                                                  open_ctree // fs_devices total_rw_bytes improperly set!
                                                    btrfs_read_chunk_tree
                                                      read_one_dev // increment total_rw_bytes again!!
                                                      super_total_bytes < fs_devices->total_rw_bytes // ERROR!!!

To fix this, we clear total_rw_bytes from within btrfs_read_chunk_tree
before the calls to read_one_dev, while holding the sb umount semaphore
and the uuid mutex.

To reproduce, it is sufficient to dirty a decent number of inodes, then
quickly umount and mount.

  for i in $(seq 0 500)
  do
    dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1
  done
  umount /mnt/foo&
  mount /mnt/foo

does the trick for me.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-29 10:18:30 +02:00
Anand Jain
20f260ed53 btrfs: include non-missing as a qualifier for the latest_bdev
commit 998a0671961f66e9fad4990ed75f80ba3088c2f1 upstream.

btrfs_free_extra_devids() updates fs_devices::latest_bdev to point to
the bdev with greatest device::generation number.  For a typical-missing
device the generation number is zero so fs_devices::latest_bdev will
never point to it.

But if the missing device is due to alienation [1], then
device::generation is not zero and if it is greater or equal to the rest
of device  generations in the list, then fs_devices::latest_bdev ends up
pointing to the missing device and reports the error like [2].

[1] We maintain devices of a fsid (as in fs_device::fsid) in the
fs_devices::devices list, a device is considered as an alien device
if its fsid does not match with the fs_device::fsid

Consider a working filesystem with raid1:

  $ mkfs.btrfs -f -d raid1 -m raid1 /dev/sda /dev/sdb
  $ mount /dev/sda /mnt-raid1
  $ umount /mnt-raid1

While mnt-raid1 was unmounted the user force-adds one of its devices to
another btrfs filesystem:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt-single
  $ btrfs dev add -f /dev/sda /mnt-single

Now the original mnt-raid1 fails to mount in degraded mode, because
fs_devices::latest_bdev is pointing to the alien device.

  $ mount -o degraded /dev/sdb /mnt-raid1

[2]
mount: wrong fs type, bad option, bad superblock on /dev/sdb,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

  kernel: BTRFS warning (device sdb): devid 1 uuid 072a0192-675b-4d5a-8640-a5cf2b2c704d is missing
  kernel: BTRFS error (device sdb): failed to read devices
  kernel: BTRFS error (device sdb): open_ctree failed

Fix the root cause by checking if the device is not missing before it
can be considered for the fs_devices::latest_bdev.

CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-22 09:31:12 +02:00
Anand Jain
fd9720b8e9 btrfs: free alien device after device add
commit 7f551d969037cc128eca60688d9c5a300d84e665 upstream.

When an old device has new fsid through 'btrfs device add -f <dev>' our
fs_devices list has an alien device in one of the fs_devices lists.

By having an alien device in fs_devices, we have two issues so far

1. missing device does not not show as missing in the userland

2. degraded mount will fail

Both issues are caused by the fact that there's an alien device in the
fs_devices list. (Alien means that it does not belong to the filesystem,
identified by fsid, or does not contain btrfs filesystem at all, eg. due
to overwrite).

A device can be scanned/added through the control device ioctls
SCAN_DEV, DEVICES_READY or by ADD_DEV.

And device coming through the control device is checked against the all
other devices in the lists, but this was not the case for ADD_DEV.

This patch fixes both issues above by removing the alien device.

CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-22 09:31:12 +02:00
Nikolay Borisov
714cd4a512 btrfs: Fix split-brain handling when changing FSID to metadata uuid
[ Upstream commit 1362089d2ad7e20d16371b39d3c11990d4ec23e4 ]

Current code doesn't correctly handle the situation which arises when
a file system that has METADATA_UUID_INCOMPAT flag set and has its FSID
changed to the one in metadata uuid. This causes the incompat flag to
disappear.

In case of a power failure we could end up in a situation where part of
the disks in a multi-disk filesystem are correctly reverted to
METADATA_UUID_INCOMPAT flag unset state, while others have
METADATA_UUID_INCOMPAT set and CHANGING_FSID_V2_IN_PROGRESS.

This patch corrects the behavior required to handle the case where a
disk of the second type is scanned first, creating the necessary
btrfs_fs_devices. Subsequently, when a disk which has already completed
the transition is scanned it should overwrite the data in
btrfs_fs_devices.

Reported-by: Su Yue <Damenly_Su@gmx.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24 08:36:56 +01:00
Anand Jain
7303a0b0a5 btrfs: device stats, log when stats are zeroed
[ Upstream commit a69976bc69308aa475d0ba3b8b3efd1d013c0460 ]

We had a report indicating that some read errors aren't reported by the
device stats in the userland. It is important to have the errors
reported in the device stat as user land scripts might depend on it to
take the reasonable corrective actions. But to debug these issue we need
to be really sure that request to reset the device stat did not come
from the userland itself. So log an info message when device error reset
happens.

For example:
 BTRFS info (device sdc): device stats zeroed by btrfs(9223)

Reported-by: philip@philip-seeger.de
Link: https://www.spinics.net/lists/linux-btrfs/msg96528.html
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24 08:36:54 +01:00
Nikolay Borisov
f3107a3c9b btrfs: Handle another split brain scenario with metadata uuid feature
commit 05840710149c7d1a78ea85a2db5723f706e97d8f upstream.

There is one more cases which isn't handled by the original metadata
uuid work. Namely, when a filesystem has METADATA_UUID incompat bit and
the user decides to change the FSID to the original one e.g. have
metadata_uuid and fsid match. In case of power failure while this
operation is in progress we could end up in a situation where some of
the disks have the incompat bit removed and the other half have both
METADATA_UUID_INCOMPAT and FSID_CHANGING_IN_PROGRESS flags.

This patch handles the case where a disk that has successfully changed
its FSID such that it equals METADATA_UUID is scanned first.
Subsequently when a disk with both
METADATA_UUID_INCOMPAT/FSID_CHANGING_IN_PROGRESS flags is scanned
find_fsid_changed won't be able to find an appropriate btrfs_fs_devices.
This is done by extending find_fsid_changed to correctly find
btrfs_fs_devices whose metadata_uuid/fsid are the same and they match
the metadata_uuid of the currently scanned device.

Fixes: cc5de4e702 ("btrfs: Handle final split-brain possibility during fsid change")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reported-by: Su Yue <Damenly_Su@gmx.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-11 04:35:28 -08:00
Josef Bacik
2f7050c2b2 btrfs: check rw_devices, not num_devices for balance
commit b35cf1f0bf1f2b0b193093338414b9bd63b29015 upstream.

The fstest btrfs/154 reports

  [ 8675.381709] BTRFS: Transaction aborted (error -28)
  [ 8675.383302] WARNING: CPU: 1 PID: 31900 at fs/btrfs/block-group.c:2038 btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
  [ 8675.390925] CPU: 1 PID: 31900 Comm: btrfs Not tainted 5.5.0-rc6-default+ #935
  [ 8675.392780] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
  [ 8675.395452] RIP: 0010:btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
  [ 8675.402672] RSP: 0018:ffffb2090888fb00 EFLAGS: 00010286
  [ 8675.404413] RAX: 0000000000000000 RBX: ffff92026dfa91c8 RCX: 0000000000000001
  [ 8675.406609] RDX: 0000000000000000 RSI: ffffffff8e100899 RDI: ffffffff8e100971
  [ 8675.408775] RBP: ffff920247c61660 R08: 0000000000000000 R09: 0000000000000000
  [ 8675.410978] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffe4
  [ 8675.412647] R13: ffff92026db74000 R14: ffff920247c616b8 R15: ffff92026dfbc000
  [ 8675.413994] FS:  00007fd5e57248c0(0000) GS:ffff92027d800000(0000) knlGS:0000000000000000
  [ 8675.416146] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [ 8675.417833] CR2: 0000564aa51682d8 CR3: 000000006dcbc004 CR4: 0000000000160ee0
  [ 8675.419801] Call Trace:
  [ 8675.420742]  btrfs_start_dirty_block_groups+0x355/0x480 [btrfs]
  [ 8675.422600]  btrfs_commit_transaction+0xc8/0xaf0 [btrfs]
  [ 8675.424335]  reset_balance_state+0x14a/0x190 [btrfs]
  [ 8675.425824]  btrfs_balance.cold+0xe7/0x154 [btrfs]
  [ 8675.427313]  ? kmem_cache_alloc_trace+0x235/0x2c0
  [ 8675.428663]  btrfs_ioctl_balance+0x298/0x350 [btrfs]
  [ 8675.430285]  btrfs_ioctl+0x466/0x2550 [btrfs]
  [ 8675.431788]  ? mem_cgroup_charge_statistics+0x51/0xf0
  [ 8675.433487]  ? mem_cgroup_commit_charge+0x56/0x400
  [ 8675.435122]  ? do_raw_spin_unlock+0x4b/0xc0
  [ 8675.436618]  ? _raw_spin_unlock+0x1f/0x30
  [ 8675.438093]  ? __handle_mm_fault+0x499/0x740
  [ 8675.439619]  ? do_vfs_ioctl+0x56e/0x770
  [ 8675.441034]  do_vfs_ioctl+0x56e/0x770
  [ 8675.442411]  ksys_ioctl+0x3a/0x70
  [ 8675.443718]  ? trace_hardirqs_off_thunk+0x1a/0x1c
  [ 8675.445333]  __x64_sys_ioctl+0x16/0x20
  [ 8675.446705]  do_syscall_64+0x50/0x210
  [ 8675.448059]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [ 8675.479187] BTRFS: error (device vdb) in btrfs_create_pending_block_groups:2038: errno=-28 No space left

We now use btrfs_can_overcommit() to see if we can flip a block group
read only.  Before this would fail because we weren't taking into
account the usable un-allocated space for allocating chunks.  With my
patches we were allowed to do the balance, which is technically correct.

The test is trying to start balance on degraded mount.  So now we're
trying to allocate a chunk and cannot because we want to allocate a
RAID1 chunk, but there's only 1 device that's available for usage.  This
results in an ENOSPC.

But we shouldn't even be making it this far, we don't have enough
devices to restripe.  The problem is we're using btrfs_num_devices(),
that also includes missing devices. That's not actually what we want, we
need to use rw_devices.

The chunk_mutex is not needed here, rw_devices changes only in device
add, remove or replace, all are excluded by EXCL_OP mechanism.

Fixes: e4d8ec0f65 ("Btrfs: implement online profile changing")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add stacktrace, update changelog, drop chunk_mutex ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-23 08:22:41 +01:00
Omar Sandoval
2bae3ee327 btrfs: get rid of unique workqueue helper functions
[ Upstream commit a0cac0ec961f0d42828eeef196ac2246a2f07659 ]

Commit 9e0af23764 ("Btrfs: fix task hang under heavy compressed
write") worked around the issue that a recycled work item could get a
false dependency on the original work item due to how the workqueue code
guarantees non-reentrancy. It did so by giving different work functions
to different types of work.

However, the fixes in the previous few patches are more complete, as
they prevent a work item from being recycled at all (except for a tiny
window that the kernel workqueue code handles for us). This obsoletes
the previous fix, so we don't need the unique helpers for correctness.
The only other reason to keep them would be so they show up in stack
traces, but they always seem to be optimized to a tail call, so they
don't show up anyways. So, let's just get rid of the extra indirection.

While we're here, rename normal_work_helper() to the more informative
btrfs_work_helper().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-09 10:20:06 +01:00
Qu Wenruo
c17add7a1c btrfs: Consider system chunk array size for new SYSTEM chunks
For SYSTEM chunks, despite the regular chunk item size limit, there is
another limit due to system chunk array size.

The extra limit was removed in a refactoring, so add it back.

Fixes: e3ecdb3fde ("btrfs: factor out devs_max setting in __btrfs_alloc_chunk")
CC: stable@vger.kernel.org # 5.3+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-25 19:11:34 +02:00
Zygo Blaxell
7a54789074 btrfs: fix balance convert to single on 32-bit host CPUs
Currently, the command:

	btrfs balance start -dconvert=single,soft .

on a Raspberry Pi produces the following kernel message:

	BTRFS error (device mmcblk0p2): balance: invalid convert data profile single

This fails because we use is_power_of_2(unsigned long) to validate
the new data profile, the constant for 'single' profile uses bit 48,
and there are only 32 bits in a long on ARM.

Fix by open-coding the check using u64 variables.

Tested by completing the original balance command on several Raspberry
Pis.

Fixes: 818255feec ("btrfs: use common helper instead of open coding a bit test")
CC: stable@vger.kernel.org # 4.20+
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-01 19:37:29 +02:00
Qu Wenruo
fab2735955 btrfs: Fix a regression which we can't convert to SINGLE profile
[BUG]
With v5.3 kernel, we can't convert to SINGLE profile:

  # btrfs balance start -f -dconvert=single $mnt
  ERROR: error during balancing '/mnt/btrfs': Invalid argument
  # dmesg -t | tail
  validate_convert_profile: data profile=0x1000000000000 allowed=0x20 is_valid=1 final=0x1000000000000 ret=1
  BTRFS error (device dm-3): balance: invalid convert data profile single

[CAUSE]
With the extra debug output added, it shows that the @allowed bit is
lacking the special in-memory only SINGLE profile bit.

Thus we fail at that (profile & ~allowed) check.

This regression is caused by commit 081db89b13 ("btrfs: use raid_attr
to get allowed profiles for balance conversion") and the fact that we
don't use any bit to indicate SINGLE profile on-disk, but uses special
in-memory only bit to help distinguish different profiles.

[FIX]
Add that BTRFS_AVAIL_ALLOC_BIT_SINGLE to @allowed, so the code should be
the same as it was and fix the regression.

Reported-by: Chris Murphy <lists@colorremedies.com>
Fixes: 081db89b13 ("btrfs: use raid_attr to get allowed profiles for balance conversion")
CC: stable@vger.kernel.org # 5.3+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-25 16:00:37 +02:00
David Sterba
1dc990dfd3 btrfs: move dev_stats helpers to volumes.c
The other dev stats functions are already there and the helpers are not
used by anything else.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09 14:59:16 +02:00
David Sterba
784352fe0b btrfs: move math functions to misc.h
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09 14:59:15 +02:00
Anand Jain
d2979aa25f btrfs: use proper error values on allocation failure in clone_fs_devices
Fix the fake ENOMEM return error code to the actual error in
clone_fs_devices().

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09 14:59:14 +02:00
Anand Jain
a06dee4d7e btrfs: proper error handling when invalid device is found in find_next_devid
In a corrupted tree, if search for next devid finds the device with
devid = -1, then report the error -EUCLEAN back to the parent function
to fail gracefully.

The tree checker will not catch this in case the devids are created
using the following script:

  umount /btrfs
  dev1=/dev/sdb
  dev2=/dev/sdc
  mkfs.btrfs -fq -dsingle -msingle $dev1
  mount $dev1 /btrfs

  _fail()
  {
	  echo $1
	  exit 1
  }

  while true; do
	  btrfs dev add -f $dev2 /btrfs || _fail "add failed"
	  btrfs dev del $dev1 /btrfs || _fail "del failed"
	  dev_tmp=$dev1
	  dev1=$dev2
	  dev2=$dev_tmp
  done

With output:

  BTRFS critical (device sdb): corrupt leaf: root=3 block=313739198464 slot=1 devid=1 invalid devid: has=507 expect=[0, 506]
  BTRFS error (device sdb): block=313739198464 write time tree block corruption detected
  BTRFS: error (device sdb) in btrfs_commit_transaction:2268: errno=-5 IO failure (Error while writing out transaction)
  BTRFS warning (device sdb): Skipping commit of aborted transaction.
  BTRFS: error (device sdb) in cleanup_transaction:1827: errno=-5 IO failure

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ add script and messages ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09 14:59:14 +02:00
Dan Carpenter
f11369897e btrfs: fix error pointer check in __btrfs_map_block()
The btrfs_get_chunk_map() never returns NULL, it returns error pointers.

Fixes: 89b798ad1b ("btrfs: Use btrfs_get_io_geometry appropriately")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09 14:59:12 +02:00
Anand Jain
3b80a984d2 btrfs: dev stat drop useless goto
In the function btrfs_init_dev_stats() goto out is not needed, because the
alloc has failed. So just return -ENOMEM.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09 14:59:12 +02:00
Anand Jain
440630ea7c btrfs: dev stats item key conversion per cpu type is not needed
%found_key is not used, drop it since it hasn't been used since the
beginning in 733f4fbbc1 ("Btrfs: read device stats on mount, write
modified ones during commit").

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09 14:59:12 +02:00
David Sterba
f93c39970b btrfs: factor out sysfs code for updating sprout fsid
Wrap the fsid renaming code and move it to sysfs.c.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09 14:59:07 +02:00
David Sterba
5b28692e0c btrfs: factor out sysfs code for sending device uevent
The device uevent belongs to the sysfs API.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09 14:59:07 +02:00
Anand Jain
ae4b9b4c7d btrfs: opencode reset of all device stats
__btrfs_reset_dev_stats() is a small helper function to reset devices stat
values, and is used only once, instead just open code it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09 14:59:06 +02:00
Anand Jain
4e411a7d04 btrfs: reset device stat using btrfs_dev_stat_set
btrfs_dev_stat_reset() is an overdo in terms of wrapping. So this patch
open codes btrfs_dev_stat_reset().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09 14:59:06 +02:00
Josef Bacik
aac0023c21 btrfs: move basic block_group definitions to their own header
This is prep work for moving all of the block group cache code into its
own file.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor comment updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09 14:59:03 +02:00
Qu Wenruo
112974d406 btrfs: volumes: Remove ENOSPC-prone btrfs_can_relocate()
[BUG]
Test case btrfs/156 fails since commit 302167c50b ("btrfs: don't end
the transaction for delayed refs in throttle") with ENOSPC.

[CAUSE]
The ENOSPC is reported from btrfs_can_relocate().

This function will check:
- If this block group is empty, we can relocate
- If we can enough free space, we can relocate

Above checks are valid but the following check is vague due to its
implementation:
- If and only if we can allocated a new block group to contain all the
  used space, we can relocate

This design itself is OK, but the way to determine if we can allocate a
new block group is problematic.

btrfs_can_relocate() uses find_free_dev_extent() to find free space on a
device.
However find_free_dev_extent() only searches commit root and excludes
dev extents allocated in current trans, this makes it unable to use dev
extent just freed in current transaction.

So for the following example, btrfs_can_relocate() will report ENOSPC:
The example block group layout:
1M      129M        257M       385M      513M       550M
|///////|///////////|//////////|         |          |
// = Used bg, consider all bg is 100% used for easy calculation.
And all block groups are SINGLE, on-disk bytenr is the same as the
logical bytenr.

1) Bg in [129M, 257M) get relocated to [385M, 513M), transid=100
1M      129M        257M       385M      513M       550M
|///////|           |//////////|/////////|
In transid 100, bg in [129M, 257M) get relocated to [385M, 513M)

However transid 100 is not committed yet, so in dev commit tree, we
still have the old dev extents layout:
1M      129M        257M       385M      513M       550M
|///////|///////////|//////////|         |          |

2) Try to relocate bg [257M, 385M)
We goes into btrfs_can_relocate(), no free space in current bgs, so we
check if we can find large enough free dev extents.

The first slot is [385M, 513M), but that is already used by new bg at
[385M, 513M), so we continue search.

The remaining slot is [512M, 550M), smaller than the bg's length 128M.
So btrfs_can_relocate report ENOSPC.

However this is over killed, in fact if we just skip btrfs_can_relocate()
check, and go into regular relocation routine, at extent reservation time,
if we can't find free extent, then we fallback to commit transaction,
which will free up the dev extents and allow new block group to be created.

[FIX]
The fix here is to remove btrfs_can_relocate() completely.

If we hit the false ENOSPC case just like btrfs/156, extent allocator
will push harder by committing transaction and we will have space for
new block group, avoiding the false ENOSPC.

If we really ran out of space, we will hit ENOSPC at
relocate_block_group(), and btrfs will just reports the ENOSPC error as
usual.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09 14:59:01 +02:00
Qu Wenruo
135da9766e btrfs: volumes: Add comment for find_free_dev_extent_start()
Since commit 6df9a95e63 ("Btrfs: make the chunk allocator completely
tree lockless") we search commit root of device tree to avoid deadlock.

This introduced a safety feature, find_free_dev_extent_start() won't
use dev extents which just get freed in current transaction.

This safety feature makes sure we won't allocate new block group using
just freed dev extents to break CoW.

However, this feature also makes find_free_dev_extent_start() not
reliable reporting free device space.  Just add such comment to make
later viewer careful about this behavior.

This behavior makes one caller, btrfs_can_relocate() unreliable
determining the device free space.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09 14:59:00 +02:00
Qu Wenruo
9e3246a5f6 btrfs: volumes: Unexport find_free_dev_extent_start()
This function is only used locally in find_free_dev_extent(), no
external callers.

So unexport it.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09 14:59:00 +02:00
YueHaibing
99fccf33c2 btrfs: remove set but not used variable 'offset'
Fixes gcc '-Wunused-but-set-variable' warning:

fs/btrfs/volumes.c: In function __btrfs_map_block:
fs/btrfs/volumes.c:6023:6: warning:
 variable offset set but not used [-Wunused-but-set-variable]

It is not used any more since commit 343abd1c0ca9 ("btrfs: Use
btrfs_get_io_geometry appropriately")

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09 14:58:58 +02:00
Filipe Manana
d7cd4dd907 Btrfs: fix sysfs warning and missing raid sysfs directories
In the 5.3 merge window, commit 7c7e301406 ("btrfs: sysfs: Replace
default_attrs in ktypes with groups"), we started using the member
"defaults_groups" for the kobject type "btrfs_raid_ktype". That leads
to a series of warnings when running some test cases of fstests, such
as btrfs/027, btrfs/124 and btrfs/176. The traces produced by those
warnings are like the following:

  [116648.059212] kernfs: can not remove 'total_bytes', no directory
  [116648.060112] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80
  (...)
  [116648.066482] CPU: 3 PID: 28500 Comm: umount Tainted: G        W         5.3.0-rc3-btrfs-next-54 #1
  (...)
  [116648.069376] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80
  (...)
  [116648.072385] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282
  [116648.073437] RAX: 0000000000000000 RBX: ffffffffc0c11998 RCX: 0000000000000000
  [116648.074201] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8
  [116648.074956] RBP: ffffffffc0b9ca2f R08: 0000000000000000 R09: 0000000000000001
  [116648.075708] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120
  [116648.076434] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100
  [116648.077143] FS:  00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000
  [116648.077852] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [116648.078546] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0
  [116648.079235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [116648.079907] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [116648.080585] Call Trace:
  [116648.081262]  remove_files+0x31/0x70
  [116648.081929]  sysfs_remove_group+0x38/0x80
  [116648.082596]  sysfs_remove_groups+0x34/0x70
  [116648.083258]  kobject_del+0x20/0x60
  [116648.083933]  btrfs_free_block_groups+0x405/0x430 [btrfs]
  [116648.084608]  close_ctree+0x19a/0x380 [btrfs]
  [116648.085278]  generic_shutdown_super+0x6c/0x110
  [116648.085951]  kill_anon_super+0xe/0x30
  [116648.086621]  btrfs_kill_super+0x12/0xa0 [btrfs]
  [116648.087289]  deactivate_locked_super+0x3a/0x70
  [116648.087956]  cleanup_mnt+0xb4/0x160
  [116648.088620]  task_work_run+0x7e/0xc0
  [116648.089285]  exit_to_usermode_loop+0xfa/0x100
  [116648.089933]  do_syscall_64+0x1cb/0x220
  [116648.090567]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [116648.091197] RIP: 0033:0x7f9cdc073b37
  (...)
  [116648.100046] ---[ end trace 22e24db328ccadf8 ]---
  [116648.100618] ------------[ cut here ]------------
  [116648.101175] kernfs: can not remove 'used_bytes', no directory
  [116648.101731] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80
  (...)
  [116648.105649] CPU: 3 PID: 28500 Comm: umount Tainted: G        W         5.3.0-rc3-btrfs-next-54 #1
  (...)
  [116648.107461] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80
  (...)
  [116648.109336] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282
  [116648.109979] RAX: 0000000000000000 RBX: ffffffffc0c119a0 RCX: 0000000000000000
  [116648.110625] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8
  [116648.111283] RBP: ffffffffc0b9ca41 R08: 0000000000000000 R09: 0000000000000001
  [116648.111940] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120
  [116648.112603] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100
  [116648.113268] FS:  00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000
  [116648.113939] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [116648.114607] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0
  [116648.115286] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [116648.115966] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [116648.116649] Call Trace:
  [116648.117326]  remove_files+0x31/0x70
  [116648.117997]  sysfs_remove_group+0x38/0x80
  [116648.118671]  sysfs_remove_groups+0x34/0x70
  [116648.119342]  kobject_del+0x20/0x60
  [116648.120022]  btrfs_free_block_groups+0x405/0x430 [btrfs]
  [116648.120707]  close_ctree+0x19a/0x380 [btrfs]
  [116648.121396]  generic_shutdown_super+0x6c/0x110
  [116648.122057]  kill_anon_super+0xe/0x30
  [116648.122702]  btrfs_kill_super+0x12/0xa0 [btrfs]
  [116648.123335]  deactivate_locked_super+0x3a/0x70
  [116648.123961]  cleanup_mnt+0xb4/0x160
  [116648.124586]  task_work_run+0x7e/0xc0
  [116648.125210]  exit_to_usermode_loop+0xfa/0x100
  [116648.125830]  do_syscall_64+0x1cb/0x220
  [116648.126463]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [116648.127080] RIP: 0033:0x7f9cdc073b37
  (...)
  [116648.135923] ---[ end trace 22e24db328ccadf9 ]---

These happen because, during the unmount path, we call kobject_del() for
raid kobjects that are not fully initialized, meaning that we set their
ktype (as btrfs_raid_ktype) through link_block_group() but we didn't set
their parent kobject, which is done through btrfs_add_raid_kobjects().

We have this split raid kobject setup since commit 75cb379d26
("btrfs: defer adding raid type kobject until after chunk relocation") in
order to avoid triggering reclaim during contextes where we can not
(either we are holding a transaction handle or some lock required by
the transaction commit path), so that we do the calls to kobject_add(),
which triggers GFP_KERNEL allocations, through btrfs_add_raid_kobjects()
in contextes where it is safe to trigger reclaim. That change expected
that a new raid kobject can only be created either when mounting the
filesystem or after raid profile conversion through the relocation path.
However, we can have new raid kobject created in other two cases at least:

1) During device replace (or scrub) after adding a device a to the
   filesystem. The replace procedure (and scrub) do calls to
   btrfs_inc_block_group_ro() which can allocate a new block group
   with a new raid profile (because we now have more devices). This
   can be triggered by test cases btrfs/027 and btrfs/176.

2) During a degraded mount trough any write path. This can be triggered
   by test case btrfs/124.

Fixing this by adding extra calls to btrfs_add_raid_kobjects(), not only
makes things more complex and fragile, can also introduce deadlocks with
reclaim the following way:

1) Calling btrfs_add_raid_kobjects() at btrfs_inc_block_group_ro() or
   anywhere in the replace/scrub path will cause a deadlock with reclaim
   because if reclaim happens and a transaction commit is triggered,
   the transaction commit path will block at btrfs_scrub_pause().

2) During degraded mounts it is essentially impossible to figure out where
   to add extra calls to btrfs_add_raid_kobjects(), because allocation of
   a block group with a new raid profile can happen anywhere, which means
   we can't safely figure out which contextes are safe for reclaim, as
   we can either hold a transaction handle or some lock needed by the
   transaction commit path.

So it is too complex and error prone to have this split setup of raid
kobjects. So fix the issue by consolidating the setup of the kobjects in a
single place, at link_block_group(), and setup a nofs context there in
order to prevent reclaim being triggered by the memory allocations done
through the call chain of kobject_add().

Besides fixing the sysfs warnings during kobject_del(), this also ensures
the sysfs directories for the new raid profiles end up created and visible
to users (a bug that existed before the 5.3 commit 7c7e301406
("btrfs: sysfs: Replace default_attrs in ktypes with groups")).

Fixes: 75cb379d26 ("btrfs: defer adding raid type kobject until after chunk relocation")
Fixes: 7c7e301406 ("btrfs: sysfs: Replace default_attrs in ktypes with groups")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-08-07 16:25:44 +02:00
Johannes Thumshirn
373c3b80e4 btrfs: don't leak extent_map in btrfs_get_io_geometry()
btrfs_get_io_geometry() calls btrfs_get_chunk_map() to acquire a reference
on a extent_map, but on normal operation it does not drop this reference
anymore.

This leads to excessive kmemleak reports.

Always call free_extent_map(), not just in the error case.

Fixes: 5f1411265e ("btrfs: Introduce btrfs_io_geometry infrastructure")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-17 17:03:36 +02:00
Josef Bacik
8719aaae8d btrfs: move space_info to space-info.h
Migrate the struct definition and the one helper that's in ctree.h into
space-info.h

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:51 +02:00
Nikolay Borisov
89b798ad1b btrfs: Use btrfs_get_io_geometry appropriately
Presently btrfs_map_block is used not only to do everything necessary to
map a bio to the underlying allocation profile but it's also used to
identify how much data could be written based on btrfs' stripe logic
without actually submitting anything. This is achieved by passing NULL
for 'bbio_ret' parameter.

This patch refactors all callers that require just the mapping length
by switching them to using btrfs_io_geometry instead of calling
btrfs_map_block with a special NULL value for 'bbio_ret'. No functional
change.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:50 +02:00