Linux kernel source tree for SHARP Brain series (PW-SH1 or later)
Go to file
Filipe Manana bf4a9715a9 Btrfs: fix race between using extent maps and merging them
commit ac05ca913e9f3871126d61da275bfe8516ff01ca upstream.

We have a few cases where we allow an extent map that is in an extent map
tree to be merged with other extents in the tree. Such cases include the
unpinning of an extent after the respective ordered extent completed or
after logging an extent during a fast fsync. This can lead to subtle and
dangerous problems because when doing the merge some other task might be
using the same extent map and as consequence see an inconsistent state of
the extent map - for example sees the new length but has seen the old start
offset.

With luck this triggers a BUG_ON(), and not some silent bug, such as the
following one in __do_readpage():

  $ cat -n fs/btrfs/extent_io.c
  3061  static int __do_readpage(struct extent_io_tree *tree,
  3062                           struct page *page,
  (...)
  3127                  em = __get_extent_map(inode, page, pg_offset, cur,
  3128                                        end - cur + 1, get_extent, em_cached);
  3129                  if (IS_ERR_OR_NULL(em)) {
  3130                          SetPageError(page);
  3131                          unlock_extent(tree, cur, end);
  3132                          break;
  3133                  }
  3134                  extent_offset = cur - em->start;
  3135                  BUG_ON(extent_map_end(em) <= cur);
  (...)

Consider the following example scenario, where we end up hitting the
BUG_ON() in __do_readpage().

We have an inode with a size of 8KiB and 2 extent maps:

  extent A: file offset 0, length 4KiB, disk_bytenr = X, persisted on disk by
            a previous transaction

  extent B: file offset 4KiB, length 4KiB, disk_bytenr = X + 4KiB, not yet
            persisted but writeback started for it already. The extent map
	    is pinned since there's writeback and an ordered extent in
	    progress, so it can not be merged with extent map A yet

The following sequence of steps leads to the BUG_ON():

1) The ordered extent for extent B completes, the respective page gets its
   writeback bit cleared and the extent map is unpinned, at that point it
   is not yet merged with extent map A because it's in the list of modified
   extents;

2) Due to memory pressure, or some other reason, the MM subsystem releases
   the page corresponding to extent B - btrfs_releasepage() is called and
   returns 1, meaning the page can be released as it's not dirty, not under
   writeback anymore and the extent range is not locked in the inode's
   iotree. However the extent map is not released, either because we are
   not in a context that allows memory allocations to block or because the
   inode's size is smaller than 16MiB - in this case our inode has a size
   of 8KiB;

3) Task B needs to read extent B and ends up __do_readpage() through the
   btrfs_readpage() callback. At __do_readpage() it gets a reference to
   extent map B;

4) Task A, doing a fast fsync, calls clear_em_loggin() against extent map B
   while holding the write lock on the inode's extent map tree - this
   results in try_merge_map() being called and since it's possible to merge
   extent map B with extent map A now (the extent map B was removed from
   the list of modified extents), the merging begins - it sets extent map
   B's start offset to 0 (was 4KiB), but before it increments the map's
   length to 8KiB (4kb + 4KiB), task A is at:

   BUG_ON(extent_map_end(em) <= cur);

   The call to extent_map_end() sees the extent map has a start of 0
   and a length still at 4KiB, so it returns 4KiB and 'cur' is 4KiB, so
   the BUG_ON() is triggered.

So it's dangerous to modify an extent map that is in the tree, because some
other task might have got a reference to it before and still using it, and
needs to see a consistent map while using it. Generally this is very rare
since most paths that lookup and use extent maps also have the file range
locked in the inode's iotree. The fsync path is pretty much the only
exception where we don't do it to avoid serialization with concurrent
reads.

Fix this by not allowing an extent map do be merged if if it's being used
by tasks other then the one attempting to merge the extent map (when the
reference count of the extent map is greater than 2).

Reported-by: ryusuke1925 <st13s20@gm.ibaraki-ct.ac.jp>
Reported-by: Koki Mitani <koki.mitani.xg@hco.ntt.co.jp>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206211
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-19 19:53:00 +01:00
arch arm64: nofpsmid: Handle TIF_FOREIGN_FPSTATE flag cleanly 2020-02-14 16:34:18 -05:00
block block: fix memleak of bio integrity data 2020-01-26 10:01:09 +01:00
certs PKCS#7: Refactor verify_pkcs7_signature() 2019-08-05 18:40:18 -04:00
crypto crypto: testmgr - don't try to decrypt uninitialized buffers 2020-02-14 16:34:18 -05:00
Documentation dt-bindings: iio: adc: ad7606: Fix wrong maxItems value 2020-02-14 16:34:19 -05:00
drivers ACPI: PM: s2idle: Prevent spurious SCIs from waking up the system 2020-02-19 19:52:57 +01:00
fs Btrfs: fix race between using extent maps and merging them 2020-02-19 19:53:00 +01:00
include ACPICA: Introduce acpi_any_gpe_status_set() 2020-02-19 19:52:57 +01:00
init Revert "um: Enable CONFIG_CONSTRUCTORS" 2020-02-01 09:34:53 +00:00
ipc ipc/msg.c: consolidate all xxxctl_down() functions 2020-02-11 04:35:07 -08:00
kernel ACPI: PM: s2idle: Avoid possible race related to the EC GPE 2020-02-19 19:52:56 +01:00
lib lib/test_kasan.c: fix memory leak in kmalloc_oob_krealloc_more() 2020-02-11 04:35:14 -08:00
LICENSES LICENSES: Rename other to deprecated 2019-05-03 06:34:32 -06:00
mm mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush 2020-02-11 04:35:42 -08:00
net bpf, sockmap: Check update requirements after locking 2020-02-14 16:34:10 -05:00
samples samples/bpf: Xdp_redirect_cpu fix missing tracepoint attach 2020-02-11 04:35:29 -08:00
scripts scripts/find-unused-docs: Fix massive false positives 2020-02-11 04:35:23 -08:00
security selinux: fall back to ref-walk if audit is required 2020-02-14 16:34:20 -05:00
sound ALSA: usb-audio: Add clock validity quirk for Denon MC7000/MCX8000 2020-02-19 19:52:57 +01:00
tools tools/power/acpi: fix compilation error 2020-02-14 16:34:15 -05:00
usr gen_initramfs_list.sh: fix 'bad variable name' error 2020-01-09 10:20:00 +01:00
virt KVM: arm64: Treat emulated TVAL TimerValue as a signed 32-bit integer 2020-02-14 16:34:18 -05:00
.clang-format clang-format: Update with the latest for_each macro list 2019-08-31 10:00:51 +02:00
.cocciconfig scripts: add Linux .cocciconfig for coccinelle 2016-07-22 12:13:39 +02:00
.get_maintainer.ignore Opt out of scripts/get_maintainer.pl 2019-05-16 10:53:40 -07:00
.gitattributes .gitattributes: set git diff driver for C source code files 2016-10-07 18:46:30 -07:00
.gitignore Modules updates for v5.4 2019-09-22 10:34:46 -07:00
.mailmap ARM: SoC fixes 2019-11-10 13:41:59 -08:00
COPYING COPYING: use the new text with points to the license files 2018-03-23 12:41:45 -06:00
CREDITS MAINTAINERS: Remove Simon as Renesas SoC Co-Maintainer 2019-10-10 08:12:51 -07:00
Kbuild kbuild: do not descend to ./Kbuild when cleaning 2019-08-21 21:03:58 +09:00
Kconfig docs: kbuild: convert docs to ReST and rename to *.rst 2019-06-14 14:21:21 -06:00
MAINTAINERS MAINTAINERS: correct entries for ISDN/mISDN section 2020-02-11 04:35:06 -08:00
Makefile Linux 5.4.20 2020-02-14 16:34:20 -05:00
README Drop all 00-INDEX files from Documentation/ 2018-09-09 15:08:58 -06:00

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.