Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Jul 05, 2024
  2. May 22, 2024
    • Mike Christie's avatar
      kernel: Remove signal hacks for vhost_tasks · 240a1853
      Mike Christie authored
      This removes the signal/coredump hacks added for vhost_tasks in:
      
      Commit f9010dbd
      
       ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
      
      When that patch was added vhost_tasks did not handle SIGKILL and would
      try to ignore/clear the signal and continue on until the device's close
      function was called. In the previous patches vhost_tasks and the vhost
      drivers were converted to support SIGKILL by cleaning themselves up and
      exiting. The hacks are no longer needed so this removes them.
      
      Signed-off-by: default avatarMike Christie <michael.christie@oracle.com>
      Message-Id: <20240316004707.45557-10-michael.christie@oracle.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      240a1853
  3. May 08, 2024
    • Allen Pais's avatar
      fs/coredump: Enable dynamic configuration of max file note size · 4bbf9c3b
      Allen Pais authored
      
      Introduce the capability to dynamically configure the maximum file
      note size for ELF core dumps via sysctl.
      
      Why is this being done?
      We have observed that during a crash when there are more than 65k mmaps
      in memory, the existing fixed limit on the size of the ELF notes section
      becomes a bottleneck. The notes section quickly reaches its capacity,
      leading to incomplete memory segment information in the resulting coredump.
      This truncation compromises the utility of the coredumps, as crucial
      information about the memory state at the time of the crash might be
      omitted.
      
      This enhancement removes the previous static limit of 4MB, allowing
      system administrators to adjust the size based on system-specific
      requirements or constraints.
      
      Eg:
      $ sysctl -a | grep core_file_note_size_limit
      kernel.core_file_note_size_limit = 4194304
      
      $ sysctl -n kernel.core_file_note_size_limit
      4194304
      
      $echo 519304 > /proc/sys/kernel/core_file_note_size_limit
      
      $sysctl -n kernel.core_file_note_size_limit
      519304
      
      Attempting to write beyond the ceiling value of 16MB
      $echo 17194304 > /proc/sys/kernel/core_file_note_size_limit
      bash: echo: write error: Invalid argument
      
      Signed-off-by: default avatarVijay Nag <nagvijay@microsoft.com>
      Signed-off-by: default avatarAllen Pais <apais@linux.microsoft.com>
      Link: https://lore.kernel.org/r/20240506193700.7884-1-apais@linux.microsoft.com
      
      
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      4bbf9c3b
  4. Mar 06, 2024
    • Linus Torvalds's avatar
      iov_iter: get rid of 'copy_mc' flag · a50026bd
      Linus Torvalds authored
      This flag is only set by one single user: the magical core dumping code
      that looks up user pages one by one, and then writes them out using
      their kernel addresses (by using a BVEC_ITER).
      
      That actually ends up being a huge problem, because while we do use
      copy_mc_to_kernel() for this case and it is able to handle the possible
      machine checks involved, nothing else is really ready to handle the
      failures caused by the machine check.
      
      In particular, as reported by Tong Tiangen, we don't actually support
      fault_in_iov_iter_readable() on a machine check area.
      
      As a result, the usual logic for writing things to a file under a
      filesystem lock, which involves doing a copy with page faults disabled
      and then if that fails trying to fault pages in without holding the
      locks with fault_in_iov_iter_readable() does not work at all.
      
      We could decide to always just make the MC copy "succeed" (and filling
      the destination with zeroes), and that would then create a core dump
      file that just ignores any machine checks.
      
      But honestly, this single special case has been problematic before, and
      means that all the normal iov_iter code ends up slightly more complex
      and slower.
      
      See for example commit c9eec08b ("iov_iter: Don't deal with
      iter->copy_mc in memcpy_from_iter_mc()") where David Howells
      re-organized the code just to avoid having to check the 'copy_mc' flags
      inside the inner iov_iter loops.
      
      So considering that we have exactly one user, and that one user is a
      non-critical special case that doesn't actually ever trigger in real
      life (Tong found this with manual error injection), the sane solution is
      to just decide that the onus on handling the machine check lines on that
      user instead.
      
      Ergo, do the copy_mc_to_kernel() in the core dump logic itself, copying
      the user data to a stable kernel page before writing it out.
      
      Fixes: f1982740
      
       ("iov_iter: Convert iterate*() to inline funcs")
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarTong Tiangen <tongtiangen@huawei.com>
      Link: https://lore.kernel.org/r/20240305133336.3804360-1-tongtiangen@huawei.com
      Link: https://lore.kernel.org/all/4e80924d-9c85-f13a-722a-6a5d2b1c225a@huawei.com/
      
      
      Tested-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Reported-by: default avatarTong Tiangen <tongtiangen@huawei.com>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      a50026bd
  5. Dec 28, 2023
  6. Jun 01, 2023
    • Mike Christie's avatar
      fork, vhost: Use CLONE_THREAD to fix freezer/ps regression · f9010dbd
      Mike Christie authored
      When switching from kthreads to vhost_tasks two bugs were added:
      1. The vhost worker tasks's now show up as processes so scripts doing
      ps or ps a would not incorrectly detect the vhost task as another
      process.  2. kthreads disabled freeze by setting PF_NOFREEZE, but
      vhost tasks's didn't disable or add support for them.
      
      To fix both bugs, this switches the vhost task to be thread in the
      process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call
      get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that
      SIGKILL/STOP support is required because CLONE_THREAD requires
      CLONE_SIGHAND which requires those 2 signals to be supported.
      
      This is a modified version of the patch written by Mike Christie
      <michael.christie@oracle.com> which was a modified version of patch
      originally written by Linus.
      
      Much of what depended upon PF_IO_WORKER now depends on PF_USER_WORKER.
      Including ignoring signals, setting up the register state, and having
      get_signal return instead of calling do_group_exit.
      
      Tidied up the vhost_task abstraction so that the definition of
      vhost_task only needs to be visible inside of vhost_task.c.  Making
      it easier to review the code and tell what needs to be done where.
      As part of this the main loop has been moved from vhost_worker into
      vhost_task_fn.  vhost_worker now returns true if work was done.
      
      The main loop has been updated to call get_signal which handles
      SIGSTOP, freezing, and collects the message that tells the thread to
      exit as part of process exit.  This collection clears
      __fatal_signal_pending.  This collection is not guaranteed to
      clear signal_pending() so clear that explicitly so the schedule()
      sleeps.
      
      For now the vhost thread continues to exist and run work until the
      last file descriptor is closed and the release function is called as
      part of freeing struct file.  To avoid hangs in the coredump
      rendezvous and when killing threads in a multi-threaded exec.  The
      coredump code and de_thread have been modified to ignore vhost threads.
      
      Remvoing the special case for exec appears to require teaching
      vhost_dev_flush how to directly complete transactions in case
      the vhost thread is no longer running.
      
      Removing the special case for coredump rendezvous requires either the
      above fix needed for exec or moving the coredump rendezvous into
      get_signal.
      
      Fixes: 6e890c5d
      
       ("vhost: use vhost_tasks for worker threads")
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Co-developed-by: default avatarMike Christie <michael.christie@oracle.com>
      Signed-off-by: default avatarMike Christie <michael.christie@oracle.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f9010dbd
  7. May 17, 2023
    • Vladimir Sementsov-Ogievskiy's avatar
      coredump: require O_WRONLY instead of O_RDWR · 88e46070
      Vladimir Sementsov-Ogievskiy authored
      The motivation for this patch has been to enable using a stricter
      apparmor profile to prevent programs from reading any coredump in the
      system.
      
      However, this became something else. The following details are based on
      Christian's and Linus' archeology into the history of the number "2" in
      the coredump handling code.
      
      To make sure we're not accidently introducing some subtle behavioral
      change into the coredump code we set out on a voyage into the depths of
      history.git to figure out why this was O_RDWR in the first place.
      
      Coredump handling was introduced over 30 years ago in commit
      ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)").
      The original code used O_WRONLY:
      
          open_namei("core",O_CREAT | O_WRONLY | O_TRUNC,0600,&inode,NULL)
      
      However, this changed in 1993 and starting with commit
      9cb9f18b5d26 ("[PATCH] Linux-0.99.10 (June 7, 1993)") the coredump code
      suddenly used the constant "2":
      
          open_namei("core",O_CREAT | 2 | O_TRUNC,0600,&inode,NULL)
      
      This was curious as in the same commit the kernel switched from
      constants to proper defines in other places such as KERNEL_DS and
      USER_DS and O_RDWR did already exist.
      
      So why was "2" used? It turns out that open_namei() - an early version
      of what later turned into filp_open() - didn't accept O_RDWR.
      
      A semantic quirk of the open() uapi is the definition of the O_RDONLY
      flag. It would seem natural to define:
      
          #define O_RDWR (O_RDONLY | O_WRONLY)
      
      but that isn't possible because:
      
          #define O_RDONLY 0
      
      This makes O_RDONLY effectively meaningless when passed to the kernel.
      In other words, there has never been a way - until O_PATH at least - to
      open a file without any permission; O_RDONLY was always implied on the
      uapi side while the kernel does in fact allow opening files without
      permissions.
      
      The trouble comes when trying to map the uapi flags onto the
      corresponding file mode flags FMODE_{READ,WRITE}. This mapping still
      happens today and is causing issues to this day (We ran into this
      during additions for openat2() for example.).
      
      So the special value "3" was used to indicate that the file was opened
      for special access:
      
          f->f_flags = flag = flags;
          f->f_mode = (flag+1) & O_ACCMODE;
          if (f->f_mode)
                  flag++;
      
      This allowed the file mode to be set to FMODE_READ | FMODE_WRITE mapping
      the O_{RDONLY,WRONLY,RDWR} flags into the FMODE_{READ,WRITE} flags. The
      special access then required read-write permissions and 0 was used to
      access symlinks.
      
      But back when ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)") added
      coredump handling open_namei() took the FMODE_{READ,WRITE} flags as an
      argument. So the coredump handling introduced in
      ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)") was buggy because
      O_WRONLY shouldn't have been passed. Since O_WRONLY is 1 but
      open_namei() took FMODE_{READ,WRITE} it was passed FMODE_READ on
      accident.
      
      So 9cb9f18b5d26 ("[PATCH] Linux-0.99.10 (June 7, 1993)") was a bugfix
      for this and the 2 didn't really mean O_RDWR, it meant FMODE_WRITE which
      was correct.
      
      The clue is that FMODE_{READ,WRITE} didn't exist yet and thus a raw "2"
      value was passed.
      
      Fast forward 5 years when around 2.2.4pre4 (February 16, 1999) this code
      was changed to:
      
          -       dentry = open_namei(corefile,O_CREAT | 2 | O_TRUNC | O_NOFOLLOW, 0600);
          ...
          +       file = filp_open(corefile,O_CREAT | 2 | O_TRUNC | O_NOFOLLOW, 0600);
      
      At this point the raw "2" should have become O_WRONLY again as
      filp_open() didn't take FMODE_{READ,WRITE} but O_{RDONLY,WRONLY,RDWR}.
      
      Another 17 years later, the code was changed again cementing the mistake
      and making it almost impossible to detect when commit
      378c6520
      
       ("fs/coredump: prevent fsuid=0 dumps into user-controlled directories")
      replaced the raw "2" with O_RDWR.
      
      And now, here we are with this patch that sent us on a quest to answer
      the big questions in life such as "Why are coredump files opened with
      O_RDWR?" and "Is it safe to just use O_WRONLY?".
      
      So with this commit we're reintroducing O_WRONLY again and bringing this
      code back to its original state when it was first introduced in commit
      ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)") over 30 years ago.
      
      Signed-off-by: default avatarVladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
      Message-Id: <20230420120409.602576-1-vsementsov@yandex-team.ru>
      [brauner@kernel.org: completely rewritten commit message]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      88e46070
  8. May 02, 2023
    • Kefeng Wang's avatar
      mm: hwpoison: coredump: support recovery from dump_user_range() · 245f0922
      Kefeng Wang authored
      dump_user_range() is used to copy the user page to a coredump file, but if
      a hardware memory error occurred during copy, which called from
      __kernel_write_iter() in dump_user_range(), it crashes,
      
        CPU: 112 PID: 7014 Comm: mca-recover Not tainted 6.3.0-rc2 #425
      
        pc : __memcpy+0x110/0x260
        lr : _copy_from_iter+0x3bc/0x4c8
        ...
        Call trace:
         __memcpy+0x110/0x260
         copy_page_from_iter+0xcc/0x130
         pipe_write+0x164/0x6d8
         __kernel_write_iter+0x9c/0x210
         dump_user_range+0xc8/0x1d8
         elf_core_dump+0x308/0x368
         do_coredump+0x2e8/0xa40
         get_signal+0x59c/0x788
         do_signal+0x118/0x1f8
         do_notify_resume+0xf0/0x280
         el0_da+0x130/0x138
         el0t_64_sync_handler+0x68/0xc0
         el0t_64_sync+0x188/0x190
      
      Generally, the '->write_iter' of file ops will use copy_page_from_iter()
      and copy_page_from_iter_atomic(), change memcpy() to copy_mc_to_kernel()
      in both of them to handle #MC during source read, which stop coredump
      processing and kill the task instead of kernel panic, but the source
      address may not always a user address, so introduce a new copy_mc flag in
      struct iov_iter{} to indicate that the iter could do a safe memory copy,
      also introduce the helpers to set/cleck the flag, for now, it's only used
      in coredump's dump_user_range(), but it could expand to any other
      scenarios to fix the similar issue.
      
      Link: https://lkml.kernel.org/r/20230417045323.11054-1-wangkefeng.wang@huawei.com
      
      
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Tong Tiangen <tongtiangen@huawei.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      245f0922
  9. Feb 09, 2023
  10. Feb 03, 2023
  11. Jan 19, 2023
    • Christian Brauner's avatar
      fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap · e67fe633
      Christian Brauner authored
      Convert to struct mnt_idmap.
      Remove legacy file_mnt_user_ns() and mnt_user_ns().
      
      Last cycle we merged the necessary infrastructure in
      256c8aed
      
       ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      e67fe633
  12. Jan 18, 2023
    • Christian Brauner's avatar
      fs: port vfs_*() helpers to struct mnt_idmap · abf08576
      Christian Brauner authored
      Convert to struct mnt_idmap.
      
      Last cycle we merged the necessary infrastructure in
      256c8aed
      
       ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      abf08576
  13. Jan 10, 2023
  14. Nov 25, 2022
    • Al Viro's avatar
      use less confusing names for iov_iter direction initializers · de4eda9d
      Al Viro authored
      
      READ/WRITE proved to be actively confusing - the meanings are
      "data destination, as used with read(2)" and "data source, as
      used with write(2)", but people keep interpreting those as
      "we read data from it" and "we write data to it", i.e. exactly
      the wrong way.
      
      Call them ITER_DEST and ITER_SOURCE - at least that is harder
      to misinterpret...
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      de4eda9d
  15. Nov 18, 2022
    • Oleksandr Natalenko's avatar
      core_pattern: add CPU specifier · 8603b6f5
      Oleksandr Natalenko authored
      Statistically, in a large deployment regular segfaults may indicate a CPU
      issue.
      
      Currently, it is not possible to find out what CPU the segfault happened
      on.  There are at least two attempts to improve segfault logging with this
      regard, but they do not help in case the logs rotate.
      
      Hence, lets make sure it is possible to permanently record a CPU the task
      ran on using a new core_pattern specifier.
      
      Link: https://lkml.kernel.org/r/20220903064330.20772-1-oleksandr@redhat.com
      
      
      Signed-off-by: default avatarOleksandr Natalenko <oleksandr@redhat.com>
      Suggested-by: default avatarRenaud Métrich <rmetrich@redhat.com>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Grzegorz Halat <ghalat@redhat.com>
      Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jason A. Donenfeld <Jason@zx2c4.com>
      Cc: Joel Savitz <jsavitz@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Laurent Dufour <ldufour@linux.ibm.com>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Stephen Kitt <steve@sk2.org>
      Cc: Will Deacon <will@kernel.org>
      Cc: Xiaoming Ni <nixiaoming@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8603b6f5
  16. Nov 01, 2022
    • Kees Cook's avatar
      coredump: Proactively round up to kmalloc bucket size · 6dd142d9
      Kees Cook authored
      
      Instead of discovering the kmalloc bucket size _after_ allocation, round
      up proactively so the allocation is explicitly made for the full size,
      allowing the compiler to correctly reason about the resulting size of
      the buffer through the existing __alloc_size() hint.
      
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      6dd142d9
  17. Oct 26, 2022
  18. Oct 23, 2022
  19. Oct 03, 2022
  20. Sep 28, 2022
    • Al Viro's avatar
      [coredump] don't use __kernel_write() on kmap_local_page() · 06bbaa6d
      Al Viro authored
      passing kmap_local_page() result to __kernel_write() is unsafe -
      random ->write_iter() might (and 9p one does) get unhappy when
      passed ITER_KVEC with pointer that came from kmap_local_page().
      
      Fix by providing a variant of __kernel_write() that takes an iov_iter
      from caller (__kernel_write() becomes a trivial wrapper) and adding
      dump_emit_page() that parallels dump_emit(), except that instead of
      __kernel_write() it uses __kernel_write_iter() with ITER_BVEC source.
      
      Fixes: 3159ed57
      
       "fs/coredump: use kmap_local_page()"
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      06bbaa6d
  21. Sep 26, 2022
  22. Sep 07, 2022
    • Peter Zijlstra's avatar
      freezer,sched: Rewrite core freezer logic · f5d39b02
      Peter Zijlstra authored
      Rewrite the core freezer to behave better wrt thawing and be simpler
      in general.
      
      By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is
      ensured frozen tasks stay frozen until thawed and don't randomly wake
      up early, as is currently possible.
      
      As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up
      two PF_flags (yay!).
      
      Specifically; the current scheme works a little like:
      
      	freezer_do_not_count();
      	schedule();
      	freezer_count();
      
      And either the task is blocked, or it lands in try_to_freezer()
      through freezer_count(). Now, when it is blocked, the freezer
      considers it frozen and continues.
      
      However, on thawing, once pm_freezing is cleared, freezer_count()
      stops working, and any random/spurious wakeup will let a task run
      before its time.
      
      That is, thawing tries to thaw things in explicit order; kernel
      threads and workqueues before doing bringing SMP back before userspace
      etc.. However due to the above mentioned races it is entirely possibl...
      f5d39b02
    • Peter Zijlstra's avatar
      sched: Add TASK_ANY for wait_task_inactive() · f9fc8cad
      Peter Zijlstra authored
      
      Now that wait_task_inactive()'s @match_state argument is a mask (like
      ttwu()) it is possible to replace the special !match_state case with
      an 'all-states' value such that any blocked state will match.
      
      Suggested-by: default avatarIngo Molnar <(mingo@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/YxhkzfuFTvRnpUaH@hirez.programming.kicks-ass.net
      f9fc8cad
  23. Jul 20, 2022
    • Eric W. Biederman's avatar
      signal: Drop signals received after a fatal signal has been processed · 9a95f78e
      Eric W. Biederman authored
      In 403bad72 ("coredump: only SIGKILL should interrupt the
      coredumping task") Oleg modified the kernel to drop all signals that
      come in during a coredump except SIGKILL, and suggested that it might
      be a good idea to generalize that to other cases after the process has
      received a fatal signal.
      
      Semantically it does not make sense to perform any signal delivery
      after the process has already been killed.
      
      When a signal is sent while a process is dying today the signal is
      placed in the signal queue by __send_signal and a single task of the
      process is woken up with signal_wake_up, if there are any tasks that
      have not set PF_EXITING.
      
      Take things one step farther and have prepare_signal report that all
      signals that come after a process has been killed should be ignored.
      While retaining the historical exception of allowing SIGKILL to
      interrupt coredumps.
      
      Update the comment in fs/coredump.c to make it clear coredumps are
      special in being able to receive SIGKILL.
      
      This changes things so that a process stopped in PTRACE_EVENT_EXIT can
      not be made to escape it's ptracer and finish exiting by sending it
      SIGKILL.  That a process can be made to leave PTRACE_EVENT_EXIT and
      escape it's tracer by sending the process a SIGKILL has been
      complicating tracer's for no apparent advantage.  If the process needs
      to be made to leave PTRACE_EVENT_EXIT all that needs to happen is to
      kill the proceses's tracer.  This differs from the coredump code where
      there is no other mechanism besides honoring SIGKILL to expedite the
      end of coredumping.
      
      Link: https://lkml.kernel.org/r/875yksd4s9.fsf_-_@email.froward.int.ebiederm.org
      
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      9a95f78e
  24. Jul 16, 2022
  25. Mar 10, 2022
  26. Mar 08, 2022
  27. Mar 01, 2022
  28. Jan 22, 2022
  29. Jan 08, 2022
    • Eric W. Biederman's avatar
      signal: Remove the helper signal_group_exit · 49697335
      Eric W. Biederman authored
      This helper is misleading.  It tests for an ongoing exec as well as
      the process having received a fatal signal.
      
      Sometimes it is appropriate to treat an on-going exec differently than
      a process that is shutting down due to a fatal signal.  In particular
      taking the fast path out of exit_signals instead of retargeting
      signals is not appropriate during exec, and not changing the the exit
      code in do_group_exit during exec.
      
      Removing the helper makes it more obvious what is going on as both
      cases must be coded for explicitly.
      
      While removing the helper fix the two cases where I have observed
      using signal_group_exit resulted in the wrong result.
      
      In exit_signals only test for SIGNAL_GROUP_EXIT so that signals are
      retargetted during an exec.
      
      In do_group_exit use 0 as the exit code during an exec as de_thread
      does not set group_exit_code.  As best as I can determine
      group_exit_code has been is set to 0 most of the time during
      de_thread.  During a thread group stop group_exit_code is set to the
      stop signal and when the thread group receives SIGCONT group_exit_code
      is reset to 0.
      
      Link: https://lkml.kernel.org/r/20211213225350.27481-8-ebiederm@xmission.com
      
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      49697335
    • Eric W. Biederman's avatar
      coredump: Stop setting signal->group_exit_task · 6ac79ec5
      Eric W. Biederman authored
      Currently the coredump code sets group_exit_task so that
      signal_group_exit() will return true during a coredump.  Now that the
      coredump code always sets SIGNAL_GROUP_EXIT there is no longer a need
      to set signal->group_exit_task.
      
      Link: https://lkml.kernel.org/r/20211213225350.27481-6-ebiederm@xmission.com
      
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      6ac79ec5
    • Eric W. Biederman's avatar
      signal: Remove SIGNAL_GROUP_COREDUMP · 2f824d4d
      Eric W. Biederman authored
      After the previous cleanups "signal->core_state" is set whenever
      SIGNAL_GROUP_COREDUMP is set and "signal->core_state" is tested
      whenver the code wants to know if a coredump is in progress.  The
      remaining tests of SIGNAL_GROUP_COREDUMP also test to see if
      SIGNAL_GROUP_EXIT is set.  Similarly the only place that sets
      SIGNAL_GROUP_COREDUMP also sets SIGNAL_GROUP_EXIT.
      
      Which makes SIGNAL_GROUP_COREDUMP unecessary and redundant. So stop
      setting SIGNAL_GROUP_COREDUMP, stop testing SIGNAL_GROUP_COREDUMP, and
      remove it's definition.
      
      With the setting of SIGNAL_GROUP_COREDUMP gone, coredump_finish no
      longer needs to clear SIGNAL_GROUP_COREDUMP out of signal->flags
      by setting SIGNAL_GROUP_EXIT.
      
      Link: https://lkml.kernel.org/r/20211213225350.27481-5-ebiederm@xmission.com
      
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      2f824d4d
    • Eric W. Biederman's avatar
      signal: During coredumps set SIGNAL_GROUP_EXIT in zap_process · 752dc970
      Eric W. Biederman authored
      There are only a few places that test SIGNAL_GROUP_EXIT and
      are not also already testing SIGNAL_GROUP_COREDUMP.
      
      This will not affect the callers of signal_group_exit as zap_process
      also sets group_exit_task so signal_group_exit will continue to return
      true at the same times.
      
      This does not affect wait_task_zombie as the none of the threads
      wind up in EXIT_ZOMBIE state during a coredump.
      
      This does not affect oom_kill.c:__task_will_free_mem as
      sig->core_state is tested and handled before SIGNAL_GROUP_EXIT is
      tested for.
      
      This does not affect complete_signal as signal->core_state is tested
      for to ensure the coredump case is handled appropriately.
      
      Link: https://lkml.kernel.org/r/20211213225350.27481-4-ebiederm@xmission.com
      
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      752dc970
  30. Oct 08, 2021
    • Eric W. Biederman's avatar
      coredump: Limit coredumps to a single thread group · 0258b5fd
      Eric W. Biederman authored
      
      Today when a signal is delivered with a handler of SIG_DFL whose
      default behavior is to generate a core dump not only that process but
      every process that shares the mm is killed.
      
      In the case of vfork this looks like a real world problem.  Consider
      the following well defined sequence.
      
      	if (vfork() == 0) {
      		execve(...);
      		_exit(EXIT_FAILURE);
      	}
      
      If a signal that generates a core dump is received after vfork but
      before the execve changes the mm the process that called vfork will
      also be killed (as the mm is shared).
      
      Similarly if the execve fails after the point of no return the kernel
      delivers SIGSEGV which will kill both the exec'ing process and because
      the mm is shared the process that called vfork as well.
      
      As far as I can tell this behavior is a violation of people's
      reasonable expectations, POSIX, and is unnecessarily fragile when the
      system is low on memory.
      
      Solve this by making a userspace visible change to only kill a single
      process/thread group.  This is possible because Jann Horn recently
      modified[1] the coredump code so that the mm can safely be modified
      while the coredump is happening.  With LinuxThreads long gone I don't
      expect anyone to have a notice this behavior change in practice.
      
      To accomplish this move the core_state pointer from mm_struct to
      signal_struct, which allows different thread groups to coredump
      simultatenously.
      
      In zap_threads remove the work to kill anything except for the current
      thread group.
      
      v2: Remove core_state from the VM_BUG_ON_MM print to fix
          compile failure when CONFIG_DEBUG_VM is enabled.
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      
      [1] a07279c9 ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot")
      Fixes: d89f3847def4 ("[PATCH] thread-aware coredumps, 2.5.43-C3")
      History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
      Link: https://lkml.kernel.org/r/87y27mvnke.fsf@disp2133
      Link: https://lkml.kernel.org/r/20211007144701.67592574@canb.auug.org.au
      
      
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      0258b5fd
  31. Oct 06, 2021
    • Eric W. Biederman's avatar
      coredump: Don't perform any cleanups before dumping core · 92307383
      Eric W. Biederman authored
      Rename coredump_exit_mm to coredump_task_exit and call it from do_exit
      before PTRACE_EVENT_EXIT, and before any cleanup work for a task
      happens.  This ensures that an accurate copy of the process can be
      captured in the coredump as no cleanup for the process happens before
      the coredump completes.  This also ensures that PTRACE_EVENT_EXIT
      will not be visited by any thread until the coredump is complete.
      
      Add a new flag PF_POSTCOREDUMP so that tasks that have passed through
      coredump_task_exit can be recognized and ignored in zap_process.
      
      Now that all of the coredumping happens before exit_mm remove code to
      test for a coredump in progress from mm_release.
      
      Replace "may_ptrace_stop()" with a simple test of "current->ptrace".
      The other tests in may_ptrace_stop all concern avoiding stopping
      during a coredump.  These tests are no longer necessary as it is now
      guaranteed that fatal_signal_pending will be set if the code enters
      ptrace_stop during a coredump.  The code in ptrace_stop is guaranteed
      not to stop if fatal_signal_pending returns true.
      
      Until this change "ptrace_event(PTRACE_EVENT_EXIT)" could call
      ptrace_stop without fatal_signal_pending being true, as signals are
      dequeued in get_signal before calling do_exit.  This is no longer
      an issue as "ptrace_event(PTRACE_EVENT_EXIT)" is no longer reached
      until after the coredump completes.
      
      Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133
      
      
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      92307383
    • Eric W. Biederman's avatar
      exit: Factor coredump_exit_mm out of exit_mm · d67e03e3
      Eric W. Biederman authored
      Separate the coredump logic from the ordinary exit_mm logic
      by moving the coredump logic out of exit_mm into it's own
      function coredump_exit_mm.
      
      Link: https://lkml.kernel.org/r/87a6k2x277.fsf@disp2133
      
      
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      d67e03e3
  32. Sep 08, 2021