Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Nov 03, 2023
  2. Oct 18, 2023
  3. Aug 24, 2023
    • Xiubo Li's avatar
      ceph: drop messages from MDS when unmounting · e3dfcab2
      Xiubo Li authored
      When unmounting all the dirty buffers will be flushed and after
      the last osd request is finished the last reference of the i_count
      will be released. Then it will flush the dirty cap/snap to MDSs,
      and the unmounting won't wait the possible acks, which will ihold
      the inodes when updating the metadata locally but makes no sense
      any more, of this. This will make the evict_inodes() to skip these
      inodes.
      
      If encrypt is enabled the kernel generate a warning when removing
      the encrypt keys when the skipped inodes still hold the keyring:
      
      WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0
      CPU: 4 PID: 168846 Comm: umount Tainted: G S  6.1.0-rc5-ceph-g72ead199864c #1
      Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
      RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0
      RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202
      RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00
      RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000
      RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000
      R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40
      R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000
      FS:  00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
      <TASK>
      generic_shutdown_super+0x47/0x120
      kill_anon_super+0x14/0x30
      ceph_kill_sb+0x36/0x90 [ceph]
      deactivate_locked_super+0x29/0x60
      cleanup_mnt+0xb8/0x140
      task_work_run+0x67/0xb0
      exit_to_user_mode_prepare+0x23d/0x240
      syscall_exit_to_user_mode+0x25/0x60
      do_syscall_64+0x40/0x80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7fd83dc39e9b
      
      Later the kernel will crash when iput() the inodes and dereferencing
      the "sb->s_master_keys", which has been released by the
      generic_shutdown_super().
      
      Link: https://tracker.ceph.com/issues/59162
      
      
      Signed-off-by: default avatarXiubo Li <xiubli@redhat.com>
      Reviewed-and-tested-by: default avatarLuís Henriques <lhenriques@suse.de>
      Reviewed-by: default avatarMilind Changire <mchangir@redhat.com>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      e3dfcab2
  4. Jul 13, 2023
  5. Jun 30, 2023
  6. Jun 08, 2023
  7. May 18, 2023
  8. Feb 02, 2023
  9. Nov 14, 2022
  10. Jun 09, 2022
    • David Howells's avatar
      netfs: Fix gcc-12 warning by embedding vfs inode in netfs_i_context · 874c8ca1
      David Howells authored
      While randstruct was satisfied with using an open-coded "void *" offset
      cast for the netfs_i_context <-> inode casting, __builtin_object_size() as
      used by FORTIFY_SOURCE was not as easily fooled.  This was causing the
      following complaint[1] from gcc v12:
      
        In file included from include/linux/string.h:253,
                         from include/linux/ceph/ceph_debug.h:7,
                         from fs/ceph/inode.c:2:
        In function 'fortify_memset_chk',
            inlined from 'netfs_i_context_init' at include/linux/netfs.h:326:2,
            inlined from 'ceph_alloc_inode' at fs/ceph/inode.c:463:2:
        include/linux/fortify-string.h:242:25: warning: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
          242 |                         __write_overflow_field(p_size_field, size);
              |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Fix this by embedding a struct inode into struct netfs_i_context (which
      should perhaps be renamed to struct netfs_inode).  The struct inode
      vfs_inode fields are then removed from the 9p, afs, ceph and cifs inode
      structs and vfs_inode is then simply changed to "netfs.inode" in those
      filesystems.
      
      Further, rename netfs_i_context to netfs_inode, get rid of the
      netfs_inode() function that converted a netfs_i_context pointer to an
      inode pointer (that can now be done with &ctx->inode) and rename the
      netfs_i_context() function to netfs_inode() (which is now a wrapper
      around container_of()).
      
      Most of the changes were done with:
      
        perl -p -i -e 's/vfs_inode/netfs.inode/'g \
              `git grep -l 'vfs_inode' -- fs/{9p,afs,ceph,cifs}/*.[ch]`
      
      Kees suggested doing it with a pair structure[2] and a special
      declarator to insert that into the network filesystem's inode
      wrapper[3], but I think it's cleaner to embed it - and then it doesn't
      matter if struct randomisation reorders things.
      
      Dave Chinner suggested using a filesystem-specific VFS_I() function in
      each filesystem to convert that filesystem's own inode wrapper struct
      into the VFS inode struct[4].
      
      Version #2:
       - Fix a couple of missed name changes due to a disabled cifs option.
       - Rename nfs_i_context to nfs_inode
       - Use "netfs" instead of "nic" as the member name in per-fs inode wrapper
         structs.
      
      [ This also undoes commit 507160f4 ("netfs: gcc-12: temporarily
        disable '-Wattribute-warning' for now") that is no longer needed ]
      
      Fixes: bc899ee1
      
       ("netfs: Add a netfs inode context")
      Reported-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarXiubo Li <xiubli@redhat.com>
      cc: Jonathan Corbet <corbet@lwn.net>
      cc: Eric Van Hensbergen <ericvh@gmail.com>
      cc: Latchesar Ionkov <lucho@ionkov.net>
      cc: Dominique Martinet <asmadeus@codewreck.org>
      cc: Christian Schoenebeck <linux_oss@crudebyte.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: Ilya Dryomov <idryomov@gmail.com>
      cc: Steve French <smfrench@gmail.com>
      cc: William Kucharski <william.kucharski@oracle.com>
      cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      cc: Dave Chinner <david@fromorbit.com>
      cc: linux-doc@vger.kernel.org
      cc: v9fs-developer@lists.sourceforge.net
      cc: linux-afs@lists.infradead.org
      cc: ceph-devel@vger.kernel.org
      cc: linux-cifs@vger.kernel.org
      cc: samba-technical@lists.samba.org
      cc: linux-fsdevel@vger.kernel.org
      cc: linux-hardening@vger.kernel.org
      Link: https://lore.kernel.org/r/d2ad3a3d7bdd794c6efb562d2f2b655fb67756b9.camel@kernel.org/ [1]
      Link: https://lore.kernel.org/r/20220517210230.864239-1-keescook@chromium.org/ [2]
      Link: https://lore.kernel.org/r/20220518202212.2322058-1-keescook@chromium.org/ [3]
      Link: https://lore.kernel.org/r/20220524101205.GI2306852@dread.disaster.area/ [4]
      Link: https://lore.kernel.org/r/165296786831.3591209.12111293034669289733.stgit@warthog.procyon.org.uk/ # v1
      Link: https://lore.kernel.org/r/165305805651.4094995.7763502506786714216.stgit@warthog.procyon.org.uk
      
       # v2
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      874c8ca1
  11. Mar 01, 2022
  12. Sep 02, 2021
  13. Aug 25, 2021
    • Xiubo Li's avatar
      ceph: correctly handle releasing an embedded cap flush · b2f9fa1f
      Xiubo Li authored
      The ceph_cap_flush structures are usually dynamically allocated, but
      the ceph_cap_snap has an embedded one.
      
      When force umounting, the client will try to remove all the session
      caps. During this, it will free them, but that should not be done
      with the ones embedded in a capsnap.
      
      Fix this by adding a new boolean that indicates that the cap flush is
      embedded in a capsnap, and skip freeing it if that's set.
      
      At the same time, switch to using list_del_init() when detaching the
      i_list and g_list heads.  It's possible for a forced umount to remove
      these objects but then handle_cap_flushsnap_ack() races in and does the
      list_del_init() again, corrupting memory.
      
      Cc: stable@vger.kernel.org
      URL: https://tracker.ceph.com/issues/52283
      
      
      Signed-off-by: default avatarXiubo Li <xiubli@redhat.com>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      b2f9fa1f
  14. Aug 04, 2021
    • Jeff Layton's avatar
      ceph: take snap_empty_lock atomically with snaprealm refcount change · 8434ffe7
      Jeff Layton authored
      There is a race in ceph_put_snap_realm. The change to the nref and the
      spinlock acquisition are not done atomically, so you could decrement
      nref, and before you take the spinlock, the nref is incremented again.
      At that point, you end up putting it on the empty list when it
      shouldn't be there. Eventually __cleanup_empty_realms runs and frees
      it when it's still in-use.
      
      Fix this by protecting the 1->0 transition with atomic_dec_and_lock,
      and just drop the spinlock if we can get the rwsem.
      
      Because these objects can also undergo a 0->1 refcount transition, we
      must protect that change as well with the spinlock. Increment locklessly
      unless the value is at 0, in which case we take the spinlock, increment
      and then take it off the empty list if it did the 0->1 transition.
      
      With these changes, I'm removing the dout() messages from these
      functions, as well as in __put_snap_realm. They've always been racy, and
      it's better to not print values that may be misleading.
      
      Cc: stable@vger.kernel.org
      URL: https://tracker.ceph.com/issues/46419
      
      
      Reported-by: default avatarMark Nelson <mnelson@redhat.com>
      Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      Reviewed-by: default avatarLuis Henriques <lhenriques@suse.de>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      8434ffe7
  15. Jun 28, 2021
  16. Apr 27, 2021
  17. Feb 16, 2021
  18. Nov 04, 2020
    • Jeff Layton's avatar
      ceph: check session state after bumping session->s_seq · 62575e27
      Jeff Layton authored
      Some messages sent by the MDS entail a session sequence number
      increment, and the MDS will drop certain types of requests on the floor
      when the sequence numbers don't match.
      
      In particular, a REQUEST_CLOSE message can cross with one of the
      sequence morphing messages from the MDS which can cause the client to
      stall, waiting for a response that will never come.
      
      Originally, this meant an up to 5s delay before the recurring workqueue
      job kicked in and resent the request, but a recent change made it so
      that the client would never resend, causing a 60s stall unmounting and
      sometimes a blockisting event.
      
      Add a new helper for incrementing the session sequence and then testing
      to see whether a REQUEST_CLOSE needs to be resent, and move the handling
      of CEPH_MDS_SESSION_CLOSING into that function. Change all of the
      bare sequence counter increments to use the new helper.
      
      Reorganize check_session_state with a switch statement.  It should no
      longer be called when the session is CLOSING, so throw a warning if it
      ever is (but still handle that case sanely).
      
      [ idryomov: whitespace, pr_err() call fixup ]
      
      URL: https://tracker.ceph.com/issues/47563
      Fixes: fa996773
      
       ("ceph: fix potential mdsc use-after-free crash")
      Reported-by: default avatarPatrick Donnelly <pdonnell@redhat.com>
      Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      Reviewed-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: default avatarXiubo Li <xiubli@redhat.com>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      62575e27
  19. Oct 12, 2020
  20. Mar 23, 2020
    • Luis Henriques's avatar
      ceph: fix memory leak in ceph_cleanup_snapid_map() · c8d6ee01
      Luis Henriques authored
      kmemleak reports the following memory leak:
      
      unreferenced object 0xffff88821feac8a0 (size 96):
        comm "kworker/1:0", pid 17, jiffies 4294896362 (age 20.512s)
        hex dump (first 32 bytes):
          a0 c8 ea 1f 82 88 ff ff 00 c9 ea 1f 82 88 ff ff  ................
          00 00 00 00 00 00 00 00 00 01 00 00 00 00 ad de  ................
        backtrace:
          [<00000000b3ea77fb>] ceph_get_snapid_map+0x75/0x2a0
          [<00000000d4060942>] fill_inode+0xb26/0x1010
          [<0000000049da6206>] ceph_readdir_prepopulate+0x389/0xc40
          [<00000000e2fe2549>] dispatch+0x11ab/0x1521
          [<000000007700b894>] ceph_con_workfn+0xf3d/0x3240
          [<0000000039138a41>] process_one_work+0x24d/0x590
          [<00000000eb751f34>] worker_thread+0x4a/0x3d0
          [<000000007e8f0d42>] kthread+0xfb/0x130
          [<00000000d49bd1fa>] ret_from_fork+0x3a/0x50
      
      A kfree is missing while looping the 'to_free' list of ceph_snapid_map
      objects.
      
      Cc: stable@vger.kernel.org
      Fixes: 75c9627e
      
       ("ceph: map snapid to anonymous bdev ID")
      Signed-off-by: default avatarLuis Henriques <lhenriques@suse.com>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      c8d6ee01
  21. Aug 22, 2019
    • Luis Henriques's avatar
      ceph: fix buffer free while holding i_ceph_lock in __ceph_build_xattrs_blob() · 12fe3dda
      Luis Henriques authored
      
      Calling ceph_buffer_put() in __ceph_build_xattrs_blob() may result in
      freeing the i_xattrs.blob buffer while holding the i_ceph_lock.  This can
      be fixed by having this function returning the old blob buffer and have
      the callers of this function freeing it when the lock is released.
      
      The following backtrace was triggered by fstests generic/117.
      
        BUG: sleeping function called from invalid context at mm/vmalloc.c:2283
        in_atomic(): 1, irqs_disabled(): 0, pid: 649, name: fsstress
        4 locks held by fsstress/649:
         #0: 00000000a7478e7e (&type->s_umount_key#19){++++}, at: iterate_supers+0x77/0xf0
         #1: 00000000f8de1423 (&(&ci->i_ceph_lock)->rlock){+.+.}, at: ceph_check_caps+0x7b/0xc60
         #2: 00000000562f2b27 (&s->s_mutex){+.+.}, at: ceph_check_caps+0x3bd/0xc60
         #3: 00000000f83ce16a (&mdsc->snap_rwsem){++++}, at: ceph_check_caps+0x3ed/0xc60
        CPU: 1 PID: 649 Comm: fsstress Not tainted 5.2.0+ #439
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x67/0x90
         ___might_sleep.cold+0x9f/0xb1
         vfree+0x4b/0x60
         ceph_buffer_release+0x1b/0x60
         __ceph_build_xattrs_blob+0x12b/0x170
         __send_cap+0x302/0x540
         ? __lock_acquire+0x23c/0x1e40
         ? __mark_caps_flushing+0x15c/0x280
         ? _raw_spin_unlock+0x24/0x30
         ceph_check_caps+0x5f0/0xc60
         ceph_flush_dirty_caps+0x7c/0x150
         ? __ia32_sys_fdatasync+0x20/0x20
         ceph_sync_fs+0x5a/0x130
         iterate_supers+0x8f/0xf0
         ksys_sync+0x4f/0xb0
         __ia32_sys_sync+0xa/0x10
         do_syscall_64+0x50/0x1c0
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        RIP: 0033:0x7fc6409ab617
      
      Signed-off-by: default avatarLuis Henriques <lhenriques@suse.com>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      12fe3dda
  22. Jul 08, 2019
  23. Jun 05, 2019
    • Yan, Zheng's avatar
      ceph: avoid iput_final() while holding mutex or in dispatch thread · 3e1d0452
      Yan, Zheng authored
      
      iput_final() may wait for reahahead pages. The wait can cause deadlock.
      For example:
      
        Workqueue: ceph-msgr ceph_con_workfn [libceph]
          Call Trace:
           schedule+0x36/0x80
           io_schedule+0x16/0x40
           __lock_page+0x101/0x140
           truncate_inode_pages_range+0x556/0x9f0
           truncate_inode_pages_final+0x4d/0x60
           evict+0x182/0x1a0
           iput+0x1d2/0x220
           iterate_session_caps+0x82/0x230 [ceph]
           dispatch+0x678/0xa80 [ceph]
           ceph_con_workfn+0x95b/0x1560 [libceph]
           process_one_work+0x14d/0x410
           worker_thread+0x4b/0x460
           kthread+0x105/0x140
           ret_from_fork+0x22/0x40
      
        Workqueue: ceph-msgr ceph_con_workfn [libceph]
          Call Trace:
           __schedule+0x3d6/0x8b0
           schedule+0x36/0x80
           schedule_preempt_disabled+0xe/0x10
           mutex_lock+0x2f/0x40
           ceph_check_caps+0x505/0xa80 [ceph]
           ceph_put_wrbuffer_cap_refs+0x1e5/0x2c0 [ceph]
           writepages_finish+0x2d3/0x410 [ceph]
           __complete_request+0x26/0x60 [libceph]
           handle_reply+0x6c8/0xa10 [libceph]
           dispatch+0x29a/0xbb0 [libceph]
           ceph_con_workfn+0x95b/0x1560 [libceph]
           process_one_work+0x14d/0x410
           worker_thread+0x4b/0x460
           kthread+0x105/0x140
           ret_from_fork+0x22/0x40
      
      In above example, truncate_inode_pages_range() waits for readahead pages
      while holding s_mutex. ceph_check_caps() waits for s_mutex and blocks
      OSD dispatch thread. Later OSD replies (for readahead) can't be handled.
      
      ceph_check_caps() also may lock snap_rwsem for read. So similar deadlock
      can happen if iput_final() is called while holding snap_rwsem.
      
      In general, it's not good to call iput_final() inside MDS/OSD dispatch
      threads or while holding any mutex.
      
      The fix is introducing ceph_async_iput(), which calls iput_final() in
      workqueue.
      
      Signed-off-by: default avatar"Yan, Zheng" <zyan@redhat.com>
      Reviewed-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      3e1d0452
  24. Apr 23, 2019
  25. Mar 05, 2019
  26. Feb 18, 2019