Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Aug 01, 2024
  2. Jul 30, 2024
  3. Jul 29, 2024
    • Filipe Manana's avatar
      btrfs: fix corruption after buffer fault in during direct IO append write · 939b656b
      Filipe Manana authored
      During an append (O_APPEND write flag) direct IO write if the input buffer
      was not previously faulted in, we can corrupt the file in a way that the
      final size is unexpected and it includes an unexpected hole.
      
      The problem happens like this:
      
      1) We have an empty file, with size 0, for example;
      
      2) We do an O_APPEND direct IO with a length of 4096 bytes and the input
         buffer is not currently faulted in;
      
      3) We enter btrfs_direct_write(), lock the inode and call
         generic_write_checks(), which calls generic_write_checks_count(), and
         that function sets the iocb position to 0 with the following code:
      
      	if (iocb->ki_flags & IOCB_APPEND)
      		iocb->ki_pos = i_size_read(inode);
      
      4) We call btrfs_dio_write() and enter into iomap, which will end up
         calling btrfs_dio_iomap_begin() and that calls
         btrfs_get_blocks_direct_write(), where we update the i_size of the
         inode to 4096 bytes;
      
      5) After btrfs_dio_iomap_begin() returns, iomap will attempt to access
         the page of the write input buffer (at iomap_dio_bio_iter(), with a
         call to bio_iov_iter_get_pages()) and fail with -EFAULT, which gets
         returned to btrfs at btrfs_direct_write() via btrfs_dio_write();
      
      6) At btrfs_direct_write() we get the -EFAULT error, unlock the inode,
         fault in the write buffer and then goto to the label 'relock';
      
      7) We lock again the inode, do all the necessary checks again and call
         again generic_write_checks(), which calls generic_write_checks_count()
         again, and there we set the iocb's position to 4K, which is the current
         i_size of the inode, with the following code pointed above:
      
              if (iocb->ki_flags & IOCB_APPEND)
                      iocb->ki_pos = i_size_read(inode);
      
      8) Then we go again to btrfs_dio_write() and enter iomap and the write
         succeeds, but it wrote to the file range [4K, 8K), leaving a hole in
         the [0, 4K) range and an i_size of 8K, which goes against the
         expectations of having the data written to the range [0, 4K) and get an
         i_size of 4K.
      
      Fix this by not unlocking the inode before faulting in the input buffer,
      in case we get -EFAULT or an incomplete write, and not jumping to the
      'relock' label after faulting in the buffer - instead jump to a location
      immediately before calling iomap, skipping all the write checks and
      relocking. This solves this problem and it's fine even in case the input
      buffer is memory mapped to the same file range, since only holding the
      range locked in the inode's io tree can cause a deadlock, it's safe to
      keep the inode lock (VFS lock), as was fixed and described in commit
      51bd9563
      
       ("btrfs: fix deadlock due to page faults during direct IO
      reads and writes").
      
      A sample reproducer provided by a reporter is the following:
      
         $ cat test.c
         #ifndef _GNU_SOURCE
         #define _GNU_SOURCE
         #endif
      
         #include <fcntl.h>
         #include <stdio.h>
         #include <sys/mman.h>
         #include <sys/stat.h>
         #include <unistd.h>
      
         int main(int argc, char *argv[])
         {
             if (argc < 2) {
                 fprintf(stderr, "Usage: %s <test file>\n", argv[0]);
                 return 1;
             }
      
             int fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT |
                           O_APPEND, 0644);
             if (fd < 0) {
                 perror("creating test file");
                 return 1;
             }
      
             char *buf = mmap(NULL, 4096, PROT_READ,
                              MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
             ssize_t ret = write(fd, buf, 4096);
             if (ret < 0) {
                 perror("pwritev2");
                 return 1;
             }
      
             struct stat stbuf;
             ret = fstat(fd, &stbuf);
             if (ret < 0) {
                 perror("stat");
                 return 1;
             }
      
             printf("size: %llu\n", (unsigned long long)stbuf.st_size);
             return stbuf.st_size == 4096 ? 0 : 1;
         }
      
      A test case for fstests will be sent soon.
      
      Reported-by: default avatarHanna Czenczek <hreitz@redhat.com>
      Link: https://lore.kernel.org/linux-btrfs/0b841d46-12fe-4e64-9abb-871d8d0de271@redhat.com/
      Fixes: 8184620a
      
       ("btrfs: fix lost file sync on direct IO write with nowait and dsync iocb")
      CC: stable@vger.kernel.org # 6.1+
      Tested-by: default avatarHanna Czenczek <hreitz@redhat.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      939b656b
    • Naohiro Aota's avatar
      btrfs: zoned: fix zone_unusable accounting on making block group read-write again · 8cd44dd1
      Naohiro Aota authored
      When btrfs makes a block group read-only, it adds all free regions in the
      block group to space_info->bytes_readonly. That free space excludes
      reserved and pinned regions. OTOH, when btrfs makes the block group
      read-write again, it moves all the unused regions into the block group's
      zone_unusable. That unused region includes reserved and pinned regions.
      As a result, it counts too much zone_unusable bytes.
      
      Fortunately (or unfortunately), having erroneous zone_unusable does not
      affect the calculation of space_info->bytes_readonly, because free
      space (num_bytes in btrfs_dec_block_group_ro) calculation is done based on
      the erroneous zone_unusable and it reduces the num_bytes just to cancel the
      error.
      
      This behavior can be easily discovered by adding a WARN_ON to check e.g,
      "bg->pinned > 0" in btrfs_dec_block_group_ro(), and running fstests test
      case like btrfs/282.
      
      Fix it by properly considering pinned and reserved in
      btrfs_dec_block_group_ro(). Also, add a WARN_ON and introduce
      btrfs_space_info_update_bytes_zone_unusable() to catch a similar mistake.
      
      Fixes: 169e0da9
      
       ("btrfs: zoned: track unusable bytes for zones")
      CC: stable@vger.kernel.org # 5.15+
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8cd44dd1
    • Naohiro Aota's avatar
      btrfs: do not subtract delalloc from avail bytes · d89c285d
      Naohiro Aota authored
      The block group's avail bytes printed when dumping a space info subtract
      the delalloc_bytes. However, as shown in btrfs_add_reserved_bytes() and
      btrfs_free_reserved_bytes(), it is added or subtracted along with
      "reserved" for the delalloc case, which means the "delalloc_bytes" is a
      part of the "reserved" bytes. So, excluding it to calculate the avail space
      counts delalloc_bytes twice, which can lead to an invalid result.
      
      Fixes: e50b122b
      
       ("btrfs: print available space for a block group when dumping a space info")
      CC: stable@vger.kernel.org # 6.6+
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d89c285d
    • Boris Burkov's avatar
      btrfs: make cow_file_range_inline() honor locked_page on error · 47857437
      Boris Burkov authored
      
      The btrfs buffered write path runs through __extent_writepage() which
      has some tricky return value handling for writepage_delalloc().
      Specifically, when that returns 1, we exit, but for other return values
      we continue and end up calling btrfs_folio_end_all_writers(). If the
      folio has been unlocked (note that we check the PageLocked bit at the
      start of __extent_writepage()), this results in an assert panic like
      this one from syzbot:
      
        BTRFS: error (device loop0 state EAL) in free_log_tree:3267: errno=-5 IO failure
        BTRFS warning (device loop0 state EAL): Skipping commit of aborted transaction.
        BTRFS: error (device loop0 state EAL) in cleanup_transaction:2018: errno=-5 IO failure
        assertion failed: folio_test_locked(folio), in fs/btrfs/subpage.c:871
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/subpage.c:871!
        Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
        CPU: 1 PID: 5090 Comm: syz-executor225 Not tainted
        6.10.0-syzkaller-05505-gb1bc554e009e #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
        Google 06/27/2024
        RIP: 0010:btrfs_folio_end_all_writers+0x55b/0x610 fs/btrfs/subpage.c:871
        Code: e9 d3 fb ff ff e8 25 22 c2 fd 48 c7 c7 c0 3c 0e 8c 48 c7 c6 80 3d
        0e 8c 48 c7 c2 60 3c 0e 8c b9 67 03 00 00 e8 66 47 ad 07 90 <0f> 0b e8
        6e 45 b0 07 4c 89 ff be 08 00 00 00 e8 21 12 25 fe 4c 89
        RSP: 0018:ffffc900033d72e0 EFLAGS: 00010246
        RAX: 0000000000000045 RBX: 00fff0000000402c RCX: 663b7a08c50a0a00
        RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
        RBP: ffffc900033d73b0 R08: ffffffff8176b98c R09: 1ffff9200067adfc
        R10: dffffc0000000000 R11: fffff5200067adfd R12: 0000000000000001
        R13: dffffc0000000000 R14: 0000000000000000 R15: ffffea0001cbee80
        FS:  0000000000000000(0000) GS:ffff8880b9500000(0000)
        knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f5f076012f8 CR3: 000000000e134000 CR4: 00000000003506f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
        <TASK>
        __extent_writepage fs/btrfs/extent_io.c:1597 [inline]
        extent_write_cache_pages fs/btrfs/extent_io.c:2251 [inline]
        btrfs_writepages+0x14d7/0x2760 fs/btrfs/extent_io.c:2373
        do_writepages+0x359/0x870 mm/page-writeback.c:2656
        filemap_fdatawrite_wbc+0x125/0x180 mm/filemap.c:397
        __filemap_fdatawrite_range mm/filemap.c:430 [inline]
        __filemap_fdatawrite mm/filemap.c:436 [inline]
        filemap_flush+0xdf/0x130 mm/filemap.c:463
        btrfs_release_file+0x117/0x130 fs/btrfs/file.c:1547
        __fput+0x24a/0x8a0 fs/file_table.c:422
        task_work_run+0x24f/0x310 kernel/task_work.c:222
        exit_task_work include/linux/task_work.h:40 [inline]
        do_exit+0xa2f/0x27f0 kernel/exit.c:877
        do_group_exit+0x207/0x2c0 kernel/exit.c:1026
        __do_sys_exit_group kernel/exit.c:1037 [inline]
        __se_sys_exit_group kernel/exit.c:1035 [inline]
        __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1035
        x64_sys_call+0x2634/0x2640
        arch/x86/include/generated/asm/syscalls_64.h:232
        do_syscall_x64 arch/x86/entry/common.c:52 [inline]
        do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
        entry_SYSCALL_64_after_hwframe+0x77/0x7f
        RIP: 0033:0x7f5f075b70c9
        Code: Unable to access opcode bytes at
        0x7f5f075b709f.
      
      I was hitting the same issue by doing hundreds of accelerated runs of
      generic/475, which also hits IO errors by design.
      
      I instrumented that reproducer with bpftrace and found that the
      undesirable folio_unlock was coming from the following callstack:
      
        folio_unlock+5
        __process_pages_contig+475
        cow_file_range_inline.constprop.0+230
        cow_file_range+803
        btrfs_run_delalloc_range+566
        writepage_delalloc+332
        __extent_writepage # inlined in my stacktrace, but I added it here
        extent_write_cache_pages+622
      
      Looking at the bisected-to patch in the syzbot report, Josef realized
      that the logic of the cow_file_range_inline error path subtly changing.
      In the past, on error, it jumped to out_unlock in cow_file_range(),
      which honors the locked_page, so when we ultimately call
      folio_end_all_writers(), the folio of interest is still locked. After
      the change, we always unlocked ignoring the locked_page, on both success
      and error. On the success path, this all results in returning 1 to
      __extent_writepage(), which skips the folio_end_all_writers() call,
      which makes it OK to have unlocked.
      
      Fix the bug by wiring the locked_page into cow_file_range_inline() and
      only setting locked_page to NULL on success.
      
      Reported-by: default avatar <syzbot+a14d8ac9af3a2a4fd0c8@syzkaller.appspotmail.com>
      Fixes: 0586d0a8
      
       ("btrfs: move extent bit and page cleanup into cow_file_range_inline")
      CC: stable@vger.kernel.org # 6.10+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      47857437
    • Chen Ni's avatar
      xfs: convert comma to semicolon · 7bf888fa
      Chen Ni authored
      Replace a comma between expression statements by a semicolon.
      
      Fixes: 178b48d5
      
       ("xfs: remove the for_each_xbitmap_ helpers")
      Signed-off-by: default avatarChen Ni <nichen@iscas.ac.cn>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      7bf888fa
    • Chen Ni's avatar
      xfs: convert comma to semicolon · 8c2263b9
      Chen Ni authored
      
      Replace a comma between expression statements by a semicolon.
      
      Signed-off-by: default avatarChen Ni <nichen@iscas.ac.cn>
      Fixes: 8f4b980e
      
       ("xfs: pass the attr value to put_listent when possible")
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      8c2263b9
  4. Jul 28, 2024
  5. Jul 27, 2024
  6. Jul 26, 2024
  7. Jul 25, 2024
    • Filipe Manana's avatar
      btrfs: fix corrupt read due to bad offset of a compressed extent map · de9f46cb
      Filipe Manana authored
      If we attempt to insert a compressed extent map that has a range that
      overlaps another extent map we have in the inode's extent map tree, we
      can end up with an incorrect offset after adjusting the new extent map at
      merge_extent_mapping() because we don't update the extent map's offset.
      
      For example consider the following scenario:
      
      1) We have a file extent item for a compressed extent covering the file
         range [108K, 144K) and currently there's no corresponding extent map
         in the inode's extent map tree;
      
      2) The inode's size is 141K;
      
      3) We have an encoded write (compressed) into the file range [120K, 128K),
         which overlaps the existing file extent item. The encoded write creates
         a matching extent map, adds it to the inode's extent map tree and
         creates an ordered extent for it.
      
         Note that the corresponding file extent item is added to the subvolume
         tree only when the ordered extent completes (when executing
         btrfs_finish_one_ordered());
      
      4) We have a write into the file range [160K, 164K).
      
         This writes increases the i_size of the file, and there's a hole
         between the current i_size (141K) and the start offset of this write,
         and since the old i_size is in the middle of the block [140K, 144K),
         we have to write zeroes to the range [141K, 144K) (3072 bytes) and
         therefore dirty that page.
      
         We then call btrfs_set_extent_delalloc() with a start offset of 140K.
         We then end up at btrfs_find_new_delalloc_bytes() which will call
         btrfs_get_extent() for the range [140K, 144K);
      
      5) The btrfs_get_extent() doesn't find any extent map in the inode's
         extent map tree covering the range [140K, 144K), so it searches the
         subvolume tree for any file extent items covering that range.
      
         There it finds the file extent item for the range [108K, 144K),
         creates a compressed extent map for that range and then calls
         btrfs_add_extent_mapping() with that extent map and passes the
         range [140K, 144K) via the "start" and "len" parameters;
      
      6) The call to add_extent_mapping() done by btrfs_add_extent_mapping()
         fails with -EEXIST because there's an extent map, created at step 2
         for the [120K, 128K) range, that covers that overlaps with the range
         of the given extent map ([108K, 144K)).
      
         Then it does a lookup for extent map from step 2 add calls
         merge_extent_mapping() to adjust the input extent map ([108K, 144K)).
         That adjust the extent map to a start offset of 128K and a length
         of 16K (starting just after the extent map from step 2), but it does
         not update the offset field of the extent map, leaving it with a value
         of zero instead of updating to a value of 20K (128K - 108K = 20K).
      
         As a result any read for the range [128K, 144K) can return
         incorrect data since we read from a wrong section of the extent (unless
         both the correct and incorrect ranges happen to have the same data).
      
      So fix this by changing merge_extent_mapping() to update the extent map's
      offset even if it's compressed. Also add a test case to the self tests.
      This didn't happen before the patchset that does big changes in the extent
      map structure (which includes the commit in the Fixes tag below) because
      we kept track of the original start offset in the extent map (member
      "orig_start") so we could always calculate the correct offset by
      subtracting that offset from the start offset.
      
      A test case for fstests that triggered this problem using send/receive
      with compressed writes will be added soon.
      
      Fixes: 3d2ac992
      
       ("btrfs: introduce new members for extent_map")
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      de9f46cb
    • Qu Wenruo's avatar
      btrfs: tree-checker: validate dref root and objectid · f333a3c7
      Qu Wenruo authored
      
      [CORRUPTION]
      There is a bug report that btrfs flips RO due to a corruption in the
      extent tree, the involved dumps looks like this:
      
       	item 188 key (402811572224 168 4096) itemoff 14598 itemsize 79
       		extent refs 3 gen 3678544 flags 1
       		ref#0: extent data backref root 13835058055282163977 objectid 281473384125923 offset 81432576 count 1
       		ref#1: shared data backref parent 1947073626112 count 1
       		ref#2: shared data backref parent 1156030103552 count 1
       BTRFS critical (device vdc1: state EA): unable to find ref byte nr 402811572224 parent 0 root 265 owner 28703026 offset 81432576 slot 189
       BTRFS error (device vdc1: state EA): failed to run delayed ref for logical 402811572224 num_bytes 4096 type 178 action 2 ref_mod 1: -2
      
      [CAUSE]
      The corrupted entry is ref#0 of item 188.
      The root number 13835058055282163977 is beyond the upper limit for root
      items (the current limit is 1 << 48), and the objectid also looks
      suspicious.
      
      Only the offset and count is correct.
      
      [ENHANCEMENT]
      Although it's still unknown why we have such many bytes corrupted
      randomly, we can still enhance the tree-checker for data backrefs by:
      
      - Validate the root value
        For now there should only be 3 types of roots can have data backref:
        * subvolume trees
        * data reloc trees
        * root tree
          Only for v1 space cache
      
      - validate the objectid value
        The objectid should be a valid inode number.
      
      Hopefully we can catch such problem in the future with the new checkers.
      
      Reported-by: default avatarKai Krakow <hurikhan77@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CAMthOuPjg5RDT-G_LXeBBUUtzt3cq=JywF+D1_h+JYxe=WKp-Q@mail.gmail.com/#t
      
      
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f333a3c7
  8. Jul 24, 2024
    • Joel Granados's avatar
      sysctl: treewide: constify the ctl_table argument of proc_handlers · 78eb4ea2
      Joel Granados authored
      
      const qualify the struct ctl_table argument in the proc_handler function
      signatures. This is a prerequisite to moving the static ctl_table
      structs into .rodata data which will ensure that proc_handler function
      pointers cannot be modified.
      
      This patch has been generated by the following coccinelle script:
      
      ```
        virtual patch
      
        @r1@
        identifier ctl, write, buffer, lenp, ppos;
        identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
        @@
      
        int func(
        - struct ctl_table *ctl
        + const struct ctl_table *ctl
          ,int write, void *buffer, size_t *lenp, loff_t *ppos);
      
        @r2@
        identifier func, ctl, write, buffer, lenp, ppos;
        @@
      
        int func(
        - struct ctl_table *ctl
        + const struct ctl_table *ctl
          ,int write, void *buffer, size_t *lenp, loff_t *ppos)
        { ... }
      
        @r3@
        identifier func;
        @@
      
        int func(
        - struct ctl_table *
        + const struct ctl_table *
          ,int , void *, size_t *, loff_t *);
      
        @r4@
        identifier func, ctl;
        @@
      
        int func(
        - struct ctl_table *ctl
        + const struct ctl_table *ctl
          ,int , void *, size_t *, loff_t *);
      
        @r5@
        identifier func, write, buffer, lenp, ppos;
        @@
      
        int func(
        - struct ctl_table *
        + const struct ctl_table *
          ,int write, void *buffer, size_t *lenp, loff_t *ppos);
      
      ```
      
      * Code formatting was adjusted in xfs_sysctl.c to comply with code
        conventions. The xfs_stats_clear_proc_handler,
        xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
        adjusted.
      
      * The ctl_table argument in proc_watchdog_common was const qualified.
        This is called from a proc_handler itself and is calling back into
        another proc_handler, making it necessary to change it as part of the
        proc_handler migration.
      
      Co-developed-by: default avatarThomas Weißschuh <linux@weissschuh.net>
      Signed-off-by: default avatarThomas Weißschuh <linux@weissschuh.net>
      Co-developed-by: default avatarJoel Granados <j.granados@samsung.com>
      Signed-off-by: default avatarJoel Granados <j.granados@samsung.com>
      78eb4ea2
    • Linus Torvalds's avatar
      hostfs: fix folio conversion · e44be002
      Linus Torvalds authored
      Commit e3ec0fe9
      
       ("hostfs: Convert hostfs_read_folio() to use a
      folio") simplified hostfs_read_folio(), but in the process of converting
      to using folios natively also mis-used the folio_zero_tail() function
      due to the very confusing API of that function.
      
      Very arguably it's folio_zero_tail() API itself that is buggy, since it
      would make more sense (and the documentation kind of implies) that the
      third argument would be the pointer to the beginning of the folio
      buffer.
      
      But no, the third argument to folio_zero_tail() is where we should start
      zeroing the tail (even if we already also pass in the offset separately
      as the second argument).
      
      So fix the hostfs caller, and we can leave any folio_zero_tail() sanity
      cleanup for later.
      
      Reported-and-tested-by: default avatarMaciej Żenczykowski <maze@google.com>
      Fixes: e3ec0fe9 ("hostfs: Convert hostfs_read_folio() to use a folio")
      Link: https://lore.kernel.org/all/CANP3RGceNzwdb7w=vPf5=7BCid5HVQDmz1K5kC9JG42+HVAh_g@mail.gmail.com/
      
      
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e44be002
    • Christian Brauner's avatar
      inode: clarify what's locked · f5e5e97c
      Christian Brauner authored
      
      In __wait_on_freeing_inode() we warn in case the inode_hash_lock is held
      but the inode is unhashed. We then release the inode_lock. So using
      "locked" as parameter name is confusing. Use is_inode_hash_locked as
      parameter name instead.
      
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      f5e5e97c
    • David Howells's avatar
      vfs: Fix potential circular locking through setxattr() and removexattr() · c3a5e3e8
      David Howells authored
      
      When using cachefiles, lockdep may emit something similar to the circular
      locking dependency notice below.  The problem appears to stem from the
      following:
      
       (1) Cachefiles manipulates xattrs on the files in its cache when called
           from ->writepages().
      
       (2) The setxattr() and removexattr() system call handlers get the name
           (and value) from userspace after taking the sb_writers lock, putting
           accesses of the vma->vm_lock and mm->mmap_lock inside of that.
      
       (3) The afs filesystem uses a per-inode lock to prevent multiple
           revalidation RPCs and in writeback vs truncate to prevent parallel
           operations from deadlocking against the server on one side and local
           page locks on the other.
      
      Fix this by moving the getting of the name and value in {get,remove}xattr()
      outside of the sb_writers lock.  This also has the minor benefits that we
      don't need to reget these in the event of a retry and we never try to take
      the sb_writers lock in the event we can't pull the name and value into the
      kernel.
      
      Alternative approaches that might fix this include moving the dispatch of a
      write to the cache off to a workqueue or trying to do without the
      validation lock in afs.  Note that this might also affect other filesystems
      that use netfslib and/or cachefiles.
      
       ======================================================
       WARNING: possible circular locking dependency detected
       6.10.0-build2+ #956 Not tainted
       ------------------------------------------------------
       fsstress/6050 is trying to acquire lock:
       ffff888138fd82f0 (mapping.invalidate_lock#3){++++}-{3:3}, at: filemap_fault+0x26e/0x8b0
      
       but task is already holding lock:
       ffff888113f26d18 (&vma->vm_lock->lock){++++}-{3:3}, at: lock_vma_under_rcu+0x165/0x250
      
       which lock already depends on the new lock.
      
       the existing dependency chain (in reverse order) is:
      
       -> #4 (&vma->vm_lock->lock){++++}-{3:3}:
              __lock_acquire+0xaf0/0xd80
              lock_acquire.part.0+0x103/0x280
              down_write+0x3b/0x50
              vma_start_write+0x6b/0xa0
              vma_link+0xcc/0x140
              insert_vm_struct+0xb7/0xf0
              alloc_bprm+0x2c1/0x390
              kernel_execve+0x65/0x1a0
              call_usermodehelper_exec_async+0x14d/0x190
              ret_from_fork+0x24/0x40
              ret_from_fork_asm+0x1a/0x30
      
       -> #3 (&mm->mmap_lock){++++}-{3:3}:
              __lock_acquire+0xaf0/0xd80
              lock_acquire.part.0+0x103/0x280
              __might_fault+0x7c/0xb0
              strncpy_from_user+0x25/0x160
              removexattr+0x7f/0x100
              __do_sys_fremovexattr+0x7e/0xb0
              do_syscall_64+0x9f/0x100
              entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
       -> #2 (sb_writers#14){.+.+}-{0:0}:
              __lock_acquire+0xaf0/0xd80
              lock_acquire.part.0+0x103/0x280
              percpu_down_read+0x3c/0x90
              vfs_iocb_iter_write+0xe9/0x1d0
              __cachefiles_write+0x367/0x430
              cachefiles_issue_write+0x299/0x2f0
              netfs_advance_write+0x117/0x140
              netfs_write_folio.isra.0+0x5ca/0x6e0
              netfs_writepages+0x230/0x2f0
              afs_writepages+0x4d/0x70
              do_writepages+0x1e8/0x3e0
              filemap_fdatawrite_wbc+0x84/0xa0
              __filemap_fdatawrite_range+0xa8/0xf0
              file_write_and_wait_range+0x59/0x90
              afs_release+0x10f/0x270
              __fput+0x25f/0x3d0
              __do_sys_close+0x43/0x70
              do_syscall_64+0x9f/0x100
              entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
       -> #1 (&vnode->validate_lock){++++}-{3:3}:
              __lock_acquire+0xaf0/0xd80
              lock_acquire.part.0+0x103/0x280
              down_read+0x95/0x200
              afs_writepages+0x37/0x70
              do_writepages+0x1e8/0x3e0
              filemap_fdatawrite_wbc+0x84/0xa0
              filemap_invalidate_inode+0x167/0x1e0
              netfs_unbuffered_write_iter+0x1bd/0x2d0
              vfs_write+0x22e/0x320
              ksys_write+0xbc/0x130
              do_syscall_64+0x9f/0x100
              entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
       -> #0 (mapping.invalidate_lock#3){++++}-{3:3}:
              check_noncircular+0x119/0x160
              check_prev_add+0x195/0x430
              __lock_acquire+0xaf0/0xd80
              lock_acquire.part.0+0x103/0x280
              down_read+0x95/0x200
              filemap_fault+0x26e/0x8b0
              __do_fault+0x57/0xd0
              do_pte_missing+0x23b/0x320
              __handle_mm_fault+0x2d4/0x320
              handle_mm_fault+0x14f/0x260
              do_user_addr_fault+0x2a2/0x500
              exc_page_fault+0x71/0x90
              asm_exc_page_fault+0x22/0x30
      
       other info that might help us debug this:
      
       Chain exists of:
         mapping.invalidate_lock#3 --> &mm->mmap_lock --> &vma->vm_lock->lock
      
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         rlock(&vma->vm_lock->lock);
                                      lock(&mm->mmap_lock);
                                      lock(&vma->vm_lock->lock);
         rlock(mapping.invalidate_lock#3);
      
        *** DEADLOCK ***
      
       1 lock held by fsstress/6050:
        #0: ffff888113f26d18 (&vma->vm_lock->lock){++++}-{3:3}, at: lock_vma_under_rcu+0x165/0x250
      
       stack backtrace:
       CPU: 0 PID: 6050 Comm: fsstress Not tainted 6.10.0-build2+ #956
       Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
       Call Trace:
        <TASK>
        dump_stack_lvl+0x57/0x80
        check_noncircular+0x119/0x160
        ? queued_spin_lock_slowpath+0x4be/0x510
        ? __pfx_check_noncircular+0x10/0x10
        ? __pfx_queued_spin_lock_slowpath+0x10/0x10
        ? mark_lock+0x47/0x160
        ? init_chain_block+0x9c/0xc0
        ? add_chain_block+0x84/0xf0
        check_prev_add+0x195/0x430
        __lock_acquire+0xaf0/0xd80
        ? __pfx___lock_acquire+0x10/0x10
        ? __lock_release.isra.0+0x13b/0x230
        lock_acquire.part.0+0x103/0x280
        ? filemap_fault+0x26e/0x8b0
        ? __pfx_lock_acquire.part.0+0x10/0x10
        ? rcu_is_watching+0x34/0x60
        ? lock_acquire+0xd7/0x120
        down_read+0x95/0x200
        ? filemap_fault+0x26e/0x8b0
        ? __pfx_down_read+0x10/0x10
        ? __filemap_get_folio+0x25/0x1a0
        filemap_fault+0x26e/0x8b0
        ? __pfx_filemap_fault+0x10/0x10
        ? find_held_lock+0x7c/0x90
        ? __pfx___lock_release.isra.0+0x10/0x10
        ? __pte_offset_map+0x99/0x110
        __do_fault+0x57/0xd0
        do_pte_missing+0x23b/0x320
        __handle_mm_fault+0x2d4/0x320
        ? __pfx___handle_mm_fault+0x10/0x10
        handle_mm_fault+0x14f/0x260
        do_user_addr_fault+0x2a2/0x500
        exc_page_fault+0x71/0x90
        asm_exc_page_fault+0x22/0x30
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Link: https://lore.kernel.org/r/2136178.1721725194@warthog.procyon.org.uk
      
      
      cc: Alexander Viro <viro@zeniv.linux.org.uk>
      cc: Christian Brauner <brauner@kernel.org>
      cc: Jan Kara <jack@suse.cz>
      cc: Jeff Layton <jlayton@kernel.org>
      cc: Gao Xiang <xiang@kernel.org>
      cc: Matthew Wilcox <willy@infradead.org>
      cc: netfs@lists.linux.dev
      cc: linux-erofs@lists.ozlabs.org
      cc: linux-fsdevel@vger.kernel.org
      [brauner: fix minor issues]
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      c3a5e3e8
    • Jann Horn's avatar
      filelock: Fix fcntl/close race recovery compat path · f8138f2a
      Jann Horn authored
      When I wrote commit 3cad1bc0 ("filelock: Remove locks reliably when
      fcntl/close race is detected"), I missed that there are two copies of the
      code I was patching: The normal version, and the version for 64-bit offsets
      on 32-bit kernels.
      Thanks to Greg KH for stumbling over this while doing the stable
      backport...
      
      Apply exactly the same fix to the compat path for 32-bit kernels.
      
      Fixes: c293621b ("[PATCH] stale POSIX lock handling")
      Cc: stable@kernel.org
      Link: https://bugs.chromium.org/p/project-zero/issues/detail?id=2563
      
      
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Link: https://lore.kernel.org/r/20240723-fs-lock-recover-compatfix-v1-1-148096719529@google.com
      
      
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      f8138f2a
    • Christian Brauner's avatar
      fs: use all available ids · 8eac5358
      Christian Brauner authored
      The counter is unconditionally incremented for each mount allocation.
      If we set it to 1ULL << 32 we're losing 4294967296 as the first valid
      non-32 bit mount id.
      
      Link: https://lore.kernel.org/r/20240719-work-mount-namespace-v1-1-834113cab0d2@kernel.org
      
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      8eac5358
    • David Howells's avatar
      cachefiles: Set the max subreq size for cache writes to MAX_RW_COUNT · 51d37982
      David Howells authored
      
      Set the maximum size of a subrequest that writes to cachefiles to be
      MAX_RW_COUNT so that we don't overrun the maximum write we can make to the
      backing filesystem.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Link: https://lore.kernel.org/r/1599005.1721398742@warthog.procyon.org.uk
      
      
      cc: Jeff Layton <jlayton@kernel.org>
      cc: netfs@lists.linux.dev
      cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      51d37982
    • David Howells's avatar
      netfs: Fix writeback that needs to go to both server and cache · 212be98a
      David Howells authored
      When netfslib is performing writeback (ie. ->writepages), it maintains two
      parallel streams of writes, one to the server and one to the cache, but it
      doesn't mark either stream of writes as active until it gets some data that
      needs to be written to that stream.
      
      This is done because some folios will only be written to the cache
      (e.g. copying to the cache on read is done by marking the folios and
      letting writeback do the actual work) and sometimes we'll only be writing
      to the server (e.g. if there's no cache).
      
      Now, since we don't actually dispatch uploads and cache writes in parallel,
      but rather flip between the streams, depending on which has the lowest
      so-far-issued offset, and don't wait for the subreqs to finish before
      flipping, we can end up in a situation where, say, we issue a write to the
      server and this completes before we start the write to the cache.
      
      But because we only activate a stream when we first add a subreq to it, the
      result collection code may run before we manage to activate the stream -
      resulting in the folio being cleaned and having the writeback-in-progress
      mark removed.  At this point, the folio no longer belongs to us.
      
      This is only really a problem for folios that need to be written to both
      streams - and in that case, the upload to the server is started first,
      followed by the write to the cache - and the cache write may see a bad
      folio.
      
      Fix this by activating the cache stream up front if there's a cache
      available.  If there's a cache, then all data is going to be written to it.
      
      Fixes: 288ace2f
      
       ("netfs: New writeback implementation")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Link: https://lore.kernel.org/r/1599053.1721398818@warthog.procyon.org.uk
      
      
      cc: Jeff Layton <jlayton@kernel.org>
      cc: netfs@lists.linux.dev
      cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      212be98a
    • Christian Brauner's avatar
      pidfs: handle kernels without namespaces cleanly · 9b3e1504
      Christian Brauner authored
      The nsproxy structure contains nearly all of the namespaces associated
      with a task. When a given namespace type is not supported by this kernel
      the rules whether the corresponding pointer in struct nsproxy is NULL or
      always init_<ns_type>_ns differ per namespace. Ideally, that wouldn't be
      the case and for all namespace types we'd always set it to
      init_<ns_type>_ns when the corresponding namespace type isn't supported.
      
      Make sure we handle all namespaces where the pointer in struct nsproxy
      can be NULL when the namespace type isn't supported.
      
      Link: https://lore.kernel.org/r/20240722-work-pidfs-e6a83030f63e@brauner
      Fixes: 5b08bd40
      
       ("pidfs: allow retrieval of namespace file descriptors") # mainline only
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      9b3e1504
    • Edward Adam Davis's avatar
      pidfs: when time ns disabled add check for ioctl · f60d38cb
      Edward Adam Davis authored
      syzbot call pidfd_ioctl() with cmd "PIDFD_GET_TIME_NAMESPACE" and disabled
      CONFIG_TIME_NS, since time_ns is NULL, it will make NULL ponter deref in
      open_namespace.
      
      Fixes: 5b08bd40
      
       ("pidfs: allow retrieval of namespace file descriptors") # mainline only
      Reported-and-tested-by: default avatar <syzbot+34a0ee986f61f15da35d@syzkaller.appspotmail.com>
      Closes: https://syzkaller.appspot.com/bug?extid=34a0ee986f61f15da35d
      
      
      Signed-off-by: default avatarEdward Adam Davis <eadavis@qq.com>
      Link: https://lore.kernel.org/r/tencent_7FAE8DB725EE0DD69236DDABDDDE195E4F07@qq.com
      
      
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      f60d38cb
    • Congjie Zhou's avatar
      vfs: correct the comments of vfs_*() helpers · b40c8e7a
      Congjie Zhou authored
      correct the comments of vfs_*() helpers in fs/namei.c, including:
      1. vfs_create()
      2. vfs_mknod()
      3. vfs_mkdir()
      4. vfs_rmdir()
      5. vfs_symlink()
      
      All of them come from the same commit:
      6521f891
      
       "namei: prepare for idmapped mounts"
      
      The @dentry is actually the dentry of child directory rather than
      base directory(parent directory), and thus the @dir has to be
      modified due to the change of @dentry.
      
      Signed-off-by: default avatarCongjie Zhou <zcjie0802@qq.com>
      Link: https://lore.kernel.org/r/tencent_2FCF6CC9E10DC8A27AE58A5A0FE4FCE96D0A@qq.com
      
      
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      b40c8e7a
    • Mateusz Guzik's avatar
      vfs: handle __wait_on_freeing_inode() and evict() race · 5bc9ad78
      Mateusz Guzik authored
      Lockless hash lookup can find and lock the inode after it gets the
      I_FREEING flag set, at which point it blocks waiting for teardown in
      evict() to finish.
      
      However, the flag is still set even after evict() wakes up all waiters.
      
      This results in a race where if the inode lock is taken late enough, it
      can happen after both hash removal and wakeups, meaning there is nobody
      to wake the racing thread up.
      
      This worked prior to RCU-based lookup because the entire ordeal was
      synchronized with the inode hash lock.
      
      Since unhashing requires the inode lock, we can safely check whether it
      happened after acquiring it.
      
      Link: https://lore.kernel.org/v9fs/20240717102458.649b60be@kernel.org/
      
      
      Reported-by: default avatarDominique Martinet <asmadeus@codewreck.org>
      Fixes: 7180f8d9
      
       ("vfs: add rcu-based find_inode variants for iget ops")
      Signed-off-by: default avatarMateusz Guzik <mjguzik@gmail.com>
      Link: https://lore.kernel.org/r/20240718151838.611807-1-mjguzik@gmail.com
      
      
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      5bc9ad78
    • David Howells's avatar
      netfs: Rename CONFIG_FSCACHE_DEBUG to CONFIG_NETFS_DEBUG · fcad9336
      David Howells authored
      
      CONFIG_FSCACHE_DEBUG should have been renamed to CONFIG_NETFS_DEBUG, so do
      that now.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Link: https://lore.kernel.org/r/1410796.1721333406@warthog.procyon.org.uk
      
      
      cc: Uwe Kleine-König <ukleinek@kernel.org>
      cc: Christian Brauner <brauner@kernel.org>
      cc: Jeff Layton <jlayton@kernel.org>
      cc: netfs@lists.linux.dev
      cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      fcad9336