Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Jul 06, 2024
  2. Jul 03, 2024
    • Kefeng Wang's avatar
      mm: remove MIGRATE_SYNC_NO_COPY mode · 90663284
      Kefeng Wang authored
      Commit 2916ecc0 ("mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY")
      introduce a new MIGRATE_SYNC_NO_COPY mode to allow to offload the copy to
      a device DMA engine, which is only used __migrate_device_pages() to decide
      whether or not copy the old page, and the MIGRATE_SYNC_NO_COPY mode only
      set in hmm, as the MIGRATE_SYNC_NO_COPY set is removed by previous
      cleanup, it seems that we could remove the unnecessary
      MIGRATE_SYNC_NO_COPY.
      
      Link: https://lkml.kernel.org/r/20240524052843.182275-6-wangkefeng.wang@huawei.com
      
      
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarJane Chu <jane.chu@oracle.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <muchun.song@linux.dev...
      90663284
  3. Jun 20, 2024
    • Prasad Singamsetty's avatar
      fs: Initial atomic write support · c34fc6f2
      Prasad Singamsetty authored
      
      An atomic write is a write issued with torn-write protection, meaning
      that for a power failure or any other hardware failure, all or none of the
      data from the write will be stored, but never a mix of old and new data.
      
      Userspace may add flag RWF_ATOMIC to pwritev2() to indicate that the
      write is to be issued with torn-write prevention, according to special
      alignment and length rules.
      
      For any syscall interface utilizing struct iocb, add IOCB_ATOMIC for
      iocb->ki_flags field to indicate the same.
      
      A call to statx will give the relevant atomic write info for a file:
      - atomic_write_unit_min
      - atomic_write_unit_max
      - atomic_write_segments_max
      
      Both min and max values must be a power-of-2.
      
      Applications can avail of atomic write feature by ensuring that the total
      length of a write is a power-of-2 in size and also sized between
      atomic_write_unit_min and atomic_write_unit_max, inclusive. Applications
      must ensure that the write is at a naturally-aligned offset in the file
      wrt the total write length. The value in atomic_write_segments_max
      indicates the upper limit for IOV_ITER iovcnt.
      
      Add file mode flag FMODE_CAN_ATOMIC_WRITE, so files which do not have the
      flag set will have RWF_ATOMIC rejected and not just ignored.
      
      Add a type argument to kiocb_set_rw_flags() to allows reads which have
      RWF_ATOMIC set to be rejected.
      
      Helper function generic_atomic_write_valid() can be used by FSes to verify
      compliant writes. There we check for iov_iter type is for ubuf, which
      implies iovcnt==1 for pwritev2(), which is an initial restriction for
      atomic_write_segments_max. Initially the only user will be bdev file
      operations write handler. We will rely on the block BIO submission path to
      ensure write sizes are compliant for the bdev, so we don't need to check
      atomic writes sizes yet.
      
      Signed-off-by: default avatarPrasad Singamsetty <prasad.singamsetty@oracle.com>
      jpg: merge into single patch and much rewrite
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJohn Garry <john.g.garry@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Link: https://lore.kernel.org/r/20240620125359.2684798-4-john.g.garry@oracle.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c34fc6f2
  4. Apr 15, 2024
  5. Apr 05, 2024
  6. Mar 05, 2024
  7. Feb 27, 2024
  8. Feb 21, 2024
    • Bart Van Assche's avatar
      fs/aio: Restrict kiocb_set_cancel_fn() to I/O submitted via libaio · b820de74
      Bart Van Assche authored
      
      If kiocb_set_cancel_fn() is called for I/O submitted via io_uring, the
      following kernel warning appears:
      
      WARNING: CPU: 3 PID: 368 at fs/aio.c:598 kiocb_set_cancel_fn+0x9c/0xa8
      Call trace:
       kiocb_set_cancel_fn+0x9c/0xa8
       ffs_epfile_read_iter+0x144/0x1d0
       io_read+0x19c/0x498
       io_issue_sqe+0x118/0x27c
       io_submit_sqes+0x25c/0x5fc
       __arm64_sys_io_uring_enter+0x104/0xab0
       invoke_syscall+0x58/0x11c
       el0_svc_common+0xb4/0xf4
       do_el0_svc+0x2c/0xb0
       el0_svc+0x2c/0xa4
       el0t_64_sync_handler+0x68/0xb4
       el0t_64_sync+0x1a4/0x1a8
      
      Fix this by setting the IOCB_AIO_RW flag for read and write I/O that is
      submitted by libaio.
      
      Suggested-by: default avatarJens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Sandeep Dhavale <dhavale@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Link: https://lore.kernel.org/r/20240215204739.2677806-2-bvanassche@acm.org
      
      
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      b820de74
  9. Dec 28, 2023
  10. Dec 05, 2023
  11. Nov 28, 2023
  12. Nov 21, 2023
  13. Sep 20, 2023
  14. Aug 21, 2023
  15. Jul 11, 2023
  16. Jun 15, 2023
  17. Feb 09, 2023
  18. Feb 03, 2023
  19. Nov 25, 2022
    • Al Viro's avatar
      use less confusing names for iov_iter direction initializers · de4eda9d
      Al Viro authored
      
      READ/WRITE proved to be actively confusing - the meanings are
      "data destination, as used with read(2)" and "data source, as
      used with write(2)", but people keep interpreting those as
      "we read data from it" and "we write data to it", i.e. exactly
      the wrong way.
      
      Call them ITER_DEST and ITER_SOURCE - at least that is harder
      to misinterpret...
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      de4eda9d
  20. Sep 12, 2022
  21. Aug 02, 2022
  22. Jun 10, 2022
  23. Mar 16, 2022
  24. Mar 15, 2022
  25. Mar 08, 2022
  26. Jan 22, 2022
    • Xiaoming Ni's avatar
      aio: move aio sysctl to aio.c · 86b12b6c
      Xiaoming Ni authored
      The kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
      dishes, this makes it very difficult to maintain.
      
      To help with this maintenance let's start by moving sysctls to places
      where they actually belong.  The proc sysctl maintainers do not want to
      know what sysctl knobs you wish to add for your own piece of code, we
      just care about the core logic.
      
      Move aio sysctl to aio.c and use the new register_sysctl_init() to
      register the sysctl interface for aio.
      
      [mcgrof@kernel.org: adjust commit log to justify the move]
      
      Link: https://lkml.kernel.org/r/20211123202347.818157-9-mcgrof@kernel.org
      
      
      Signed-off-by: default avatarXiaoming Ni <nixiaoming@huawei.com>
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Qing Wang <wangqing@vivo.com>
      Cc: Sebastian Reichel <sre@kernel.org>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Stephen Kitt <steve@sk2.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Antti Palosaari <crope@iki.fi>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Clemens Ladisch <clemens@ladisch.de>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Julia Lawall <julia.lawall@inria.fr>
      Cc: Lukas Middendorf <kernel@tuxforce.de>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Phillip Potter <phil@philpotter.co.uk>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Douglas Gilbert <dgilbert@interlog.com>
      Cc: James E.J. Bottomley <jejb@linux.ibm.com>
      Cc: Jani Nikula <jani.nikula@intel.com>
      Cc: John Ogness <john.ogness@linutronix.de>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86b12b6c
  27. Dec 09, 2021
    • Xie Yongji's avatar
      aio: Fix incorrect usage of eventfd_signal_allowed() · 4b374986
      Xie Yongji authored
      We should defer eventfd_signal() to the workqueue when
      eventfd_signal_allowed() return false rather than return
      true.
      
      Fixes: b542e383
      
       ("eventfd: Make signal recursion protection a task bit")
      Signed-off-by: default avatarXie Yongji <xieyongji@bytedance.com>
      Link: https://lore.kernel.org/r/20210913111928.98-1-xieyongji@bytedance.com
      
      
      Reviewed-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      4b374986
    • Eric Biggers's avatar
      aio: fix use-after-free due to missing POLLFREE handling · 50252e4b
      Eric Biggers authored
      signalfd_poll() and binder_poll() are special in that they use a
      waitqueue whose lifetime is the current task, rather than the struct
      file as is normally the case.  This is okay for blocking polls, since a
      blocking poll occurs within one task; however, non-blocking polls
      require another solution.  This solution is for the queue to be cleared
      before it is freed, by sending a POLLFREE notification to all waiters.
      
      Unfortunately, only eventpoll handles POLLFREE.  A second type of
      non-blocking poll, aio poll, was added in kernel v4.18, and it doesn't
      handle POLLFREE.  This allows a use-after-free to occur if a signalfd or
      binder fd is polled with aio poll, and the waitqueue gets freed.
      
      Fix this by making aio poll handle POLLFREE.
      
      A patch by Ramji Jiyani <ramjiyani@google.com>
      (https://lore.kernel.org/r/20211027011834.2497484-1-ramjiyani@google.com)
      tried to do this by making aio_poll_wake() always complete the request
      inline if POLLFREE is seen.  However, that solution had two bugs.
      First, it introduced a deadlock, as it unconditionally locked the aio
      context while holding the waitqueue lock, which inverts the normal
      locking order.  Second, it didn't consider that POLLFREE notifications
      are missed while the request has been temporarily de-queued.
      
      The second problem was solved by my previous patch.  This patch then
      properly fixes the use-after-free by handling POLLFREE in a
      deadlock-free way.  It does this by taking advantage of the fact that
      freeing of the waitqueue is RCU-delayed, similar to what eventpoll does.
      
      Fixes: 2c14fa83 ("aio: implement IOCB_CMD_POLL")
      Cc: <stable@vger.kernel.org> # v4.18+
      Link: https://lore.kernel.org/r/20211209010455.42744-6-ebiggers@kernel.org
      
      
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      50252e4b
    • Eric Biggers's avatar
      aio: keep poll requests on waitqueue until completed · 363bee27
      Eric Biggers authored
      Currently, aio_poll_wake() will always remove the poll request from the
      waitqueue.  Then, if aio_poll_complete_work() sees that none of the
      polled events are ready and the request isn't cancelled, it re-adds the
      request to the waitqueue.  (This can easily happen when polling a file
      that doesn't pass an event mask when waking up its waitqueue.)
      
      This is fundamentally broken for two reasons:
      
        1. If a wakeup occurs between vfs_poll() and the request being
           re-added to the waitqueue, it will be missed because the request
           wasn't on the waitqueue at the time.  Therefore, IOCB_CMD_POLL
           might never complete even if the polled file is ready.
      
        2. When the request isn't on the waitqueue, there is no way to be
           notified that the waitqueue is being freed (which happens when its
           lifetime is shorter than the struct file's).  This is supposed to
           happen via the waitqueue entries being woken up with POLLFREE.
      
      Therefore, leave the requests on the waitqueue until they are actually
      completed (or cancelled).  To keep track of when aio_poll_complete_work
      needs to be scheduled, use new fields in struct poll_iocb.  Remove the
      'done' field which is now redundant.
      
      Note that this is consistent with how sys_poll() and eventpoll work;
      their wakeup functions do *not* remove the waitqueue entries.
      
      Fixes: 2c14fa83 ("aio: implement IOCB_CMD_POLL")
      Cc: <stable@vger.kernel.org> # v4.18+
      Link: https://lore.kernel.org/r/20211209010455.42744-5-ebiggers@kernel.org
      
      
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      363bee27
  28. Oct 25, 2021
  29. Oct 20, 2021
  30. Aug 27, 2021
    • Thomas Gleixner's avatar
      eventfd: Make signal recursion protection a task bit · b542e383
      Thomas Gleixner authored
      The recursion protection for eventfd_signal() is based on a per CPU
      variable and relies on the !RT semantics of spin_lock_irqsave() for
      protecting this per CPU variable. On RT kernels spin_lock_irqsave() neither
      disables preemption nor interrupts which allows the spin lock held section
      to be preempted. If the preempting task invokes eventfd_signal() as well,
      then the recursion warning triggers.
      
      Paolo suggested to protect the per CPU variable with a local lock, but
      that's heavyweight and actually not necessary. The goal of this protection
      is to prevent the task stack from overflowing, which can be achieved with a
      per task recursion protection as well.
      
      Replace the per CPU variable with a per task bit similar to other recursion
      protection bits like task_struct::in_page_owner. This works on both !RT and
      RT kernels and removes as a side effect the extra per CPU storage.
      
      No functional change for !RT kernels.
      
      Reported-by: Daniel Bristot de Oliveira <bristo...
      b542e383
  31. Apr 30, 2021
  32. Dec 15, 2020
    • Dmitry Safonov's avatar
      mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio · cd544fd1
      Dmitry Safonov authored
      As kernel expect to see only one of such mappings, any further operations
      on the VMA-copy may be unexpected by the kernel.  Maybe it's being on the
      safe side, but there doesn't seem to be any expected use-case for this, so
      restrict it now.
      
      Link: https://lkml.kernel.org/r/20201013013416.390574-4-dima@arista.com
      Fixes: commit e346b381
      
       ("mm/mremap: add MREMAP_DONTUNMAP to mremap()")
      Signed-off-by: default avatarDmitry Safonov <dima@arista.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cd544fd1