Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Jun 11, 2024
  2. May 23, 2024
    • Jeff Xu's avatar
      mseal: wire up mseal syscall · ff388fe5
      Jeff Xu authored
      Patch series "Introduce mseal", v10.
      
      This patchset proposes a new mseal() syscall for the Linux kernel.
      
      In a nutshell, mseal() protects the VMAs of a given virtual memory range
      against modifications, such as changes to their permission bits.
      
      Modern CPUs support memory permissions, such as the read/write (RW) and
      no-execute (NX) bits.  Linux has supported NX since the release of kernel
      version 2.6.8 in August 2004 [1].  The memory permission feature improves
      the security stance on memory corruption bugs, as an attacker cannot
      simply write to arbitrary memory and point the code to it.  The memory
      must be marked with the X bit, or else an exception will occur. 
      Internally, the kernel maintains the memory permissions in a data
      structure called VMA (vm_area_struct).  mseal() additionally protects the
      VMA itself against modifications of the selected seal type.
      
      Memory sealing is useful to mitigate memory corruption issues where a
      corrupted pointer is passed to a memory management system.  For example,
      such an attacker primitive can break control-flow integrity guarantees
      since read-only memory that is supposed to be trusted can become writable
      or .text pages can get remapped.  Memory sealing can automatically be
      applied by the runtime loader to seal .text and .rodata pages and
      applications can additionally seal security critical data at runtime.  A
      similar feature already exists in the XNU kernel with the
      VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall
      [4].  Also, Chrome wants to adopt this feature for their CFI work [2] and
      this patchset has been designed to be compatible with the Chrome use case.
      
      Two system calls are involved in sealing the map:  mmap() and mseal().
      
      The new mseal() is an syscall on 64 bit CPU, and with following signature:
      
      int mseal(void addr, size_t len, unsigned long flags)
      addr/len: memory range.
      flags: reserved.
      
      mseal() blocks following operations for the given memory range.
      
      1> Unmapping, moving to another location, and shrinking the size,
         via munmap() and mremap(), can leave an empty space, therefore can
         be replaced with a VMA with a new set of attributes.
      
      2> Moving or expanding a different VMA into the current location,
         via mremap().
      
      3> Modifying a VMA via mmap(MAP_FIXED).
      
      4> Size expansion, via mremap(), does not appear to pose any specific
         risks to sealed VMAs. It is included anyway because the use case is
         unclear. In any case, users can rely on merging to expand a sealed VMA.
      
      5> mprotect() and pkey_mprotect().
      
      6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
         memory, when users don't have write permission to the memory. Those
         behaviors can alter region contents by discarding pages, effectively a
         memset(0) for anonymous memory.
      
      The idea that inspired this patch comes from Stephen Röttger’s work in
      V8 CFI [5].  Chrome browser in ChromeOS will be the first user of this
      API.
      
      Indeed, the Chrome browser has very specific requirements for sealing,
      which are distinct from those of most applications.  For example, in the
      case of libc, sealing is only applied to read-only (RO) or read-execute
      (RX) memory segments (such as .text and .RELRO) to prevent them from
      becoming writable, the lifetime of those mappings are tied to the lifetime
      of the process.
      
      Chrome wants to seal two large address space reservations that are managed
      by different allocators.  The memory is mapped RW- and RWX respectively
      but write access to it is restricted using pkeys (or in the future ARM
      permission overlay extensions).  The lifetime of those mappings are not
      tied to the lifetime of the process, therefore, while the memory is
      sealed, the allocators still need to free or discard the unused memory. 
      For example, with madvise(DONTNEED).
      
      However, always allowing madvise(DONTNEED) on this range poses a security
      risk.  For example if a jump instruction crosses a page boundary and the
      second page gets discarded, it will overwrite the target bytes with zeros
      and change the control flow.  Checking write-permission before the discard
      operation allows us to control when the operation is valid.  In this case,
      the madvise will only succeed if the executing thread has PKEY write
      permissions and PKRU changes are protected in software by control-flow
      integrity.
      
      Although the initial version of this patch series is targeting the Chrome
      browser as its first user, it became evident during upstream discussions
      that we would also want to ensure that the patch set eventually is a
      complete solution for memory sealing and compatible with other use cases. 
      The specific scenario currently in mind is glibc's use case of loading and
      sealing ELF executables.  To this end, Stephen is working on a change to
      glibc to add sealing support to the dynamic linker, which will seal all
      non-writable segments at startup.  Once this work is completed, all
      applications will be able to automatically benefit from these new
      protections.
      
      In closing, I would like to formally acknowledge the valuable
      contributions received during the RFC process, which were instrumental in
      shaping this patch:
      
      Jann Horn: raising awareness and providing valuable insights on the
        destructive madvise operations.
      Liam R. Howlett: perf optimization.
      Linus Torvalds: assisting in defining system call signature and scope.
      Theo de Raadt: sharing the experiences and insight gained from
        implementing mimmutable() in OpenBSD.
      
      MM perf benchmarks
      ==================
      This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to
      check the VMAs’ sealing flag, so that no partial update can be made,
      when any segment within the given memory range is sealed.
      
      To measure the performance impact of this loop, two tests are developed.
      [8]
      
      The first is measuring the time taken for a particular system call,
      by using clock_gettime(CLOCK_MONOTONIC). The second is using
      PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have
      similar results.
      
      The tests have roughly below sequence:
      for (i = 0; i < 1000, i++)
          create 1000 mappings (1 page per VMA)
          start the sampling
          for (j = 0; j < 1000, j++)
              mprotect one mapping
          stop and save the sample
          delete 1000 mappings
      calculates all samples.
      
      Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz,
      4G memory, Chromebook.
      
      Based on the latest upstream code:
      The first test (measuring time)
      syscall__	vmas	t	t_mseal	delta_ns	per_vma	%
      munmap__  	1	909	944	35	35	104%
      munmap__  	2	1398	1502	104	52	107%
      munmap__  	4	2444	2594	149	37	106%
      munmap__  	8	4029	4323	293	37	107%
      munmap__  	16	6647	6935	288	18	104%
      munmap__  	32	11811	12398	587	18	105%
      mprotect	1	439	465	26	26	106%
      mprotect	2	1659	1745	86	43	105%
      mprotect	4	3747	3889	142	36	104%
      mprotect	8	6755	6969	215	27	103%
      mprotect	16	13748	14144	396	25	103%
      mprotect	32	27827	28969	1142	36	104%
      madvise_	1	240	262	22	22	109%
      madvise_	2	366	442	76	38	121%
      madvise_	4	623	751	128	32	121%
      madvise_	8	1110	1324	215	27	119%
      madvise_	16	2127	2451	324	20	115%
      madvise_	32	4109	4642	534	17	113%
      
      The second test (measuring cpu cycle)
      syscall__	vmas	cpu	cmseal	delta_cpu	per_vma	%
      munmap__	1	1790	1890	100	100	106%
      munmap__	2	2819	3033	214	107	108%
      munmap__	4	4959	5271	312	78	106%
      munmap__	8	8262	8745	483	60	106%
      munmap__	16	13099	14116	1017	64	108%
      munmap__	32	23221	24785	1565	49	107%
      mprotect	1	906	967	62	62	107%
      mprotect	2	3019	3203	184	92	106%
      mprotect	4	6149	6569	420	105	107%
      mprotect	8	9978	10524	545	68	105%
      mprotect	16	20448	21427	979	61	105%
      mprotect	32	40972	42935	1963	61	105%
      madvise_	1	434	497	63	63	115%
      madvise_	2	752	899	147	74	120%
      madvise_	4	1313	1513	200	50	115%
      madvise_	8	2271	2627	356	44	116%
      madvise_	16	4312	4883	571	36	113%
      madvise_	32	8376	9319	943	29	111%
      
      Based on the result, for 6.8 kernel, sealing check adds
      20-40 nano seconds, or around 50-100 CPU cycles, per VMA.
      
      In addition, I applied the sealing to 5.10 kernel:
      The first test (measuring time)
      syscall__	vmas	t	tmseal	delta_ns	per_vma	%
      munmap__	1	357	390	33	33	109%
      munmap__	2	442	463	21	11	105%
      munmap__	4	614	634	20	5	103%
      munmap__	8	1017	1137	120	15	112%
      munmap__	16	1889	2153	263	16	114%
      munmap__	32	4109	4088	-21	-1	99%
      mprotect	1	235	227	-7	-7	97%
      mprotect	2	495	464	-30	-15	94%
      mprotect	4	741	764	24	6	103%
      mprotect	8	1434	1437	2	0	100%
      mprotect	16	2958	2991	33	2	101%
      mprotect	32	6431	6608	177	6	103%
      madvise_	1	191	208	16	16	109%
      madvise_	2	300	324	24	12	108%
      madvise_	4	450	473	23	6	105%
      madvise_	8	753	806	53	7	107%
      madvise_	16	1467	1592	125	8	108%
      madvise_	32	2795	3405	610	19	122%
      					
      The second test (measuring cpu cycle)
      syscall__	nbr_vma	cpu	cmseal	delta_cpu	per_vma	%
      munmap__	1	684	715	31	31	105%
      munmap__	2	861	898	38	19	104%
      munmap__	4	1183	1235	51	13	104%
      munmap__	8	1999	2045	46	6	102%
      munmap__	16	3839	3816	-23	-1	99%
      munmap__	32	7672	7887	216	7	103%
      mprotect	1	397	443	46	46	112%
      mprotect	2	738	788	50	25	107%
      mprotect	4	1221	1256	35	9	103%
      mprotect	8	2356	2429	72	9	103%
      mprotect	16	4961	4935	-26	-2	99%
      mprotect	32	9882	10172	291	9	103%
      madvise_	1	351	380	29	29	108%
      madvise_	2	565	615	49	25	109%
      madvise_	4	872	933	61	15	107%
      madvise_	8	1508	1640	132	16	109%
      madvise_	16	3078	3323	245	15	108%
      madvise_	32	5893	6704	811	25	114%
      
      For 5.10 kernel, sealing check adds 0-15 ns in time, or 10-30
      CPU cycles, there is even decrease in some cases.
      
      It might be interesting to compare 5.10 and 6.8 kernel
      The first test (measuring time)
      syscall__	vmas	t_5_10	t_6_8	delta_ns	per_vma	%
      munmap__	1	357	909	552	552	254%
      munmap__	2	442	1398	956	478	316%
      munmap__	4	614	2444	1830	458	398%
      munmap__	8	1017	4029	3012	377	396%
      munmap__	16	1889	6647	4758	297	352%
      munmap__	32	4109	11811	7702	241	287%
      mprotect	1	235	439	204	204	187%
      mprotect	2	495	1659	1164	582	335%
      mprotect	4	741	3747	3006	752	506%
      mprotect	8	1434	6755	5320	665	471%
      mprotect	16	2958	13748	10790	674	465%
      mprotect	32	6431	27827	21397	669	433%
      madvise_	1	191	240	49	49	125%
      madvise_	2	300	366	67	33	122%
      madvise_	4	450	623	173	43	138%
      madvise_	8	753	1110	357	45	147%
      madvise_	16	1467	2127	660	41	145%
      madvise_	32	2795	4109	1314	41	147%
      
      The second test (measuring cpu cycle)
      syscall__	vmas	cpu_5_10	c_6_8	delta_cpu	per_vma	%
      munmap__	1	684	1790	1106	1106	262%
      munmap__	2	861	2819	1958	979	327%
      munmap__	4	1183	4959	3776	944	419%
      munmap__	8	1999	8262	6263	783	413%
      munmap__	16	3839	13099	9260	579	341%
      munmap__	32	7672	23221	15549	486	303%
      mprotect	1	397	906	509	509	228%
      mprotect	2	738	3019	2281	1140	409%
      mprotect	4	1221	6149	4929	1232	504%
      mprotect	8	2356	9978	7622	953	423%
      mprotect	16	4961	20448	15487	968	412%
      mprotect	32	9882	40972	31091	972	415%
      madvise_	1	351	434	82	82	123%
      madvise_	2	565	752	186	93	133%
      madvise_	4	872	1313	442	110	151%
      madvise_	8	1508	2271	763	95	151%
      madvise_	16	3078	4312	1234	77	140%
      madvise_	32	5893	8376	2483	78	142%
      
      From 5.10 to 6.8
      munmap: added 250-550 ns in time, or 500-1100 in cpu cycle, per vma.
      mprotect: added 200-750 ns in time, or 500-1200 in cpu cycle, per vma.
      madvise: added 33-50 ns in time, or 70-110 in cpu cycle, per vma.
      
      In comparison to mseal, which adds 20-40 ns or 50-100 CPU cycles, the
      increase from 5.10 to 6.8 is significantly larger, approximately ten times
      greater for munmap and mprotect.
      
      When I discuss the mm performance with Brian Makin, an engineer who worked
      on performance, it was brought to my attention that such performance
      benchmarks, which measuring millions of mm syscall in a tight loop, may
      not accurately reflect real-world scenarios, such as that of a database
      service.  Also this is tested using a single HW and ChromeOS, the data
      from another HW or distribution might be different.  It might be best to
      take this data with a grain of salt.
      
      
      This patch (of 5):
      
      Wire up mseal syscall for all architectures.
      
      Link: https://lkml.kernel.org/r/20240415163527.626541-1-jeffxu@chromium.org
      Link: https://lkml.kernel.org/r/20240415163527.626541-2-jeffxu@chromium.org
      
      
      Signed-off-by: default avatarJeff Xu <jeffxu@chromium.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <groeck@chromium.org>
      Cc: Jann Horn <jannh@google.com> [Bug #2]
      Cc: Jeff Xu <jeffxu@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Pedro Falcato <pedro.falcato@gmail.com>
      Cc: Stephen Röttger <sroettger@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Amer Al Shanawany <amer.shanawany@gmail.com>
      Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ff388fe5
  3. Mar 22, 2024
  4. Dec 14, 2023
  5. Nov 12, 2023
  6. Oct 03, 2023
  7. Sep 21, 2023
  8. Aug 02, 2023
    • Rick Edgecombe's avatar
      x86/shstk: Introduce map_shadow_stack syscall · c35559f9
      Rick Edgecombe authored
      
      When operating with shadow stacks enabled, the kernel will automatically
      allocate shadow stacks for new threads, however in some cases userspace
      will need additional shadow stacks. The main example of this is the
      ucontext family of functions, which require userspace allocating and
      pivoting to userspace managed stacks.
      
      Unlike most other user memory permissions, shadow stacks need to be
      provisioned with special data in order to be useful. They need to be setup
      with a restore token so that userspace can pivot to them via the RSTORSSP
      instruction. But, the security design of shadow stacks is that they
      should not be written to except in limited circumstances. This presents a
      problem for userspace, as to how userspace can provision this special
      data, without allowing for the shadow stack to be generally writable.
      
      Previously, a new PROT_SHADOW_STACK was attempted, which could be
      mprotect()ed from RW permissions after the data was provisioned. This was
      found to not be secure enough, as other threads could write to the
      shadow stack during the writable window.
      
      The kernel can use a special instruction, WRUSS, to write directly to
      userspace shadow stacks. So the solution can be that memory can be mapped
      as shadow stack permissions from the beginning (never generally writable
      in userspace), and the kernel itself can write the restore token.
      
      First, a new madvise() flag was explored, which could operate on the
      PROT_SHADOW_STACK memory. This had a couple of downsides:
      1. Extra checks were needed in mprotect() to prevent writable memory from
         ever becoming PROT_SHADOW_STACK.
      2. Extra checks/vma state were needed in the new madvise() to prevent
         restore tokens being written into the middle of pre-used shadow stacks.
         It is ideal to prevent restore tokens being added at arbitrary
         locations, so the check was to make sure the shadow stack had never been
         written to.
      3. It stood out from the rest of the madvise flags, as more of direct
         action than a hint at future desired behavior.
      
      So rather than repurpose two existing syscalls (mmap, madvise) that don't
      quite fit, just implement a new map_shadow_stack syscall to allow
      userspace to map and setup new shadow stacks in one step. While ucontext
      is the primary motivator, userspace may have other unforeseen reasons to
      setup its own shadow stacks using the WRSS instruction. Towards this
      provide a flag so that stacks can be optionally setup securely for the
      common case of ucontext without enabling WRSS. Or potentially have the
      kernel set up the shadow stack in some new way.
      
      The following example demonstrates how to create a new shadow stack with
      map_shadow_stack:
      void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN);
      
      Signed-off-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarBorislav Petkov (AMD) <bp@alien8.de>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Tested-by: default avatarPengfei Xu <pengfei.xu@intel.com>
      Tested-by: default avatarJohn Allen <john.allen@amd.com>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/all/20230613001108.3040476-35-rick.p.edgecombe%40intel.com
      c35559f9
  9. Jul 27, 2023
  10. Jun 09, 2023
    • Nhat Pham's avatar
      cachestat: implement cachestat syscall · cf264e13
      Nhat Pham authored
      There is currently no good way to query the page cache state of large file
      sets and directory trees.  There is mincore(), but it scales poorly: the
      kernel writes out a lot of bitmap data that userspace has to aggregate,
      when the user really doesn not care about per-page information in that
      case.  The user also needs to mmap and unmap each file as it goes along,
      which can be quite slow as well.
      
      Some use cases where this information could come in handy:
        * Allowing database to decide whether to perform an index scan or
          direct table queries based on the in-memory cache state of the
          index.
        * Visibility into the writeback algorithm, for performance issues
          diagnostic.
        * Workload-aware writeback pacing: estimating IO fulfilled by page
          cache (and IO to be done) within a range of a file, allowing for
          more frequent syncing when and where there is IO capacity, and
          batching when there is not.
        * Computing memory usage of large files/directory trees, analogous to
          the du tool for disk usage.
      
      More information about these use cases could be found in the following
      thread:
      
      https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
      
      This patch implements a new syscall that queries cache state of a file and
      summarizes the number of cached pages, number of dirty pages, number of
      pages marked for writeback, number of (recently) evicted pages, etc.  in a
      given range.  Currently, the syscall is only wired in for x86
      architecture.
      
      NAME
          cachestat - query the page cache statistics of a file.
      
      SYNOPSIS
          #include <sys/mman.h>
      
          struct cachestat_range {
              __u64 off;
              __u64 len;
          };
      
          struct cachestat {
              __u64 nr_cache;
              __u64 nr_dirty;
              __u64 nr_writeback;
              __u64 nr_evicted;
              __u64 nr_recently_evicted;
          };
      
          int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
              struct cachestat *cstat, unsigned int flags);
      
      DESCRIPTION
          cachestat() queries the number of cached pages, number of dirty
          pages, number of pages marked for writeback, number of evicted
          pages, number of recently evicted pages, in the bytes range given by
          `off` and `len`.
      
          An evicted page is a page that is previously in the page cache but
          has been evicted since. A page is recently evicted if its last
          eviction was recent enough that its reentry to the cache would
          indicate that it is actively being used by the system, and that
          there is memory pressure on the system.
      
          These values are returned in a cachestat struct, whose address is
          given by the `cstat` argument.
      
          The `off` and `len` arguments must be non-negative integers. If
          `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
          0, we will query in the range from `off` to the end of the file.
      
          The `flags` argument is unused for now, but is included for future
          extensibility. User should pass 0 (i.e no flag specified).
      
          Currently, hugetlbfs is not supported.
      
          Because the status of a page can change after cachestat() checks it
          but before it returns to the application, the returned values may
          contain stale information.
      
      RETURN VALUE
          On success, cachestat returns 0. On error, -1 is returned, and errno
          is set to indicate the error.
      
      ERRORS
          EFAULT cstat or cstat_args points to an invalid address.
      
          EINVAL invalid flags.
      
          EBADF  invalid file descriptor.
      
          EOPNOTSUPP file descriptor is of a hugetlbfs file
      
      [nphamcs@gmail.com: replace rounddown logic with the existing helper]
        Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
      Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
      
      
      Signed-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cf264e13
  11. Jan 15, 2022
  12. Oct 07, 2021
  13. Sep 08, 2021
    • Arnd Bergmann's avatar
      compat: remove some compat entry points · 59ab844e
      Arnd Bergmann authored
      These are all handled correctly when calling the native system call entry
      point, so remove the special cases.
      
      Link: https://lkml.kernel.org/r/20210727144859.4150043-6-arnd@kernel.org
      
      
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      59ab844e
  14. Sep 03, 2021
  15. Jul 08, 2021
  16. Jun 07, 2021
  17. May 17, 2021
    • Jan Kara's avatar
      quota: Disable quotactl_path syscall · 5b9fedb3
      Jan Kara authored
      In commit fa8b9007 ("quota: wire up quotactl_path") we have wired up
      new quotactl_path syscall. However some people in LWN discussion have
      objected that the path based syscall is missing dirfd and flags argument
      which is mostly standard for contemporary path based syscalls. Indeed
      they have a point and after a discussion with Christian Brauner and
      Sascha Hauer I've decided to disable the syscall for now and update its
      API. Since there is no userspace currently using that syscall and it
      hasn't been released in any major release, we should be fine.
      
      CC: Christian Brauner <christian.brauner@ubuntu.com>
      CC: Sascha Hauer <s.hauer@pengutronix.de>
      Link: https://lore.kernel.org/lkml/20210512153621.n5u43jsytbik4yze@wittgenstein
      
      
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      5b9fedb3
  18. Apr 22, 2021
  19. Mar 17, 2021
  20. Jan 24, 2021
    • Christian Brauner's avatar
      fs: add mount_setattr() · 2a186721
      Christian Brauner authored
      This implements the missing mount_setattr() syscall. While the new mount
      api allows to change the properties of a superblock there is currently
      no way to change the properties of a mount or a mount tree using file
      descriptors which the new mount api is based on. In addition the old
      mount api has the restriction that mount options cannot be applied
      recursively. This hasn't changed since changing mount options on a
      per-mount basis was implemented in [1] and has been a frequent request
      not just for convenience but also for security reasons. The legacy
      mount syscall is unable to accommodate this behavior without introducing
      a whole new set of flags because MS_REC | MS_REMOUNT | MS_BIND |
      MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to the topmost
      mount. Changing MS_REC to apply to the whole mount tree would mean
      introducing a significant uapi change and would likely cause significant
      regressions.
      
      The new mount_setattr() syscall allows to recursively clear and set
      mount options in one shot. Multiple calls to change mount options
      requesting the same changes are idempotent:
      
      int mount_setattr(int dfd, const char *path, unsigned flags,
                        struct mount_attr *uattr, size_t usize);
      
      Flags to modify path resolution behavior are specified in the @flags
      argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW,
      and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to
      restrict path resolution as introduced with openat2() might be supported
      in the future.
      
      The mount_setattr() syscall can be expected to grow over time and is
      designed with extensibility in mind. It follows the extensible syscall
      pattern we have used with other syscalls such as openat2(), clone3(),
      sched_{set,get}attr(), and others.
      The set of mount options is passed in the uapi struct mount_attr which
      currently has the following layout:
      
      struct mount_attr {
      	__u64 attr_set;
      	__u64 attr_clr;
      	__u64 propagation;
      	__u64 userns_fd;
      };
      
      The @attr_set and @attr_clr members are used to clear and set mount
      options. This way a user can e.g. request that a set of flags is to be
      raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in
      @attr_set while at the same time requesting that another set of flags is
      to be lowered such as removing noexec from a mount tree by specifying
      MOUNT_ATTR_NOEXEC in @attr_clr.
      
      Note, since the MOUNT_ATTR_<atime> values are an enum starting from 0,
      not a bitmap, users wanting to transition to a different atime setting
      cannot simply specify the atime setting in @attr_set, but must also
      specify MOUNT_ATTR__ATIME in the @attr_clr field. So we ensure that
      MOUNT_ATTR__ATIME can't be partially set in @attr_clr and that @attr_set
      can't have any atime bits set if MOUNT_ATTR__ATIME isn't set in
      @attr_clr.
      
      The @propagation field lets callers specify the propagation type of a
      mount tree. Propagation is a single property that has four different
      settings and as such is not really a flag argument but an enum.
      Specifically, it would be unclear what setting and clearing propagation
      settings in combination would amount to. The legacy mount() syscall thus
      forbids the combination of multiple propagation settings too. The goal
      is to keep the semantics of mount propagation somewhat simple as they
      are overly complex as it is.
      
      The @userns_fd field lets user specify a user namespace whose idmapping
      becomes the idmapping of the mount. This is implemented and explained in
      detail in the next patch.
      
      [1]: commit 2e4b7fcd ("[PATCH] r/o bind mounts: honor mount writer counts at remount")
      
      Link: https://lore.kernel.org/r/20210121131959.646623-35-christian.brauner@ubuntu.com
      
      
      Cc: David Howells <dhowells@redhat.com>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-api@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      2a186721
  21. Dec 19, 2020
  22. Oct 18, 2020
    • Minchan Kim's avatar
      mm/madvise: introduce process_madvise() syscall: an external memory hinting API · ecb8ac8b
      Minchan Kim authored
      There is usecase that System Management Software(SMS) want to give a
      memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
      case of Android, it is the ActivityManagerService.
      
      The information required to make the reclaim decision is not known to the
      app.  Instead, it is known to the centralized userspace
      daemon(ActivityManagerService), and that daemon must be able to initiate
      reclaim on its own without any app involvement.
      
      To solve the issue, this patch introduces a new syscall
      process_madvise(2).  It uses pidfd of an external process to give the
      hint.  It also supports vector address range because Android app has
      thousands of vmas due to zygote so it's totally waste of CPU and power if
      we should call the syscall one by one for each vma.(With testing 2000-vma
      syscall vs 1-vector syscall, it showed 15% performance improvement.  I
      think it would be bigger in real practice because the testing ran very
      cache friendly environment).
      
      Another potential use case for the vector range is to amortize the cost
      ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
      benefit users like TCP receive zerocopy and malloc implementations.  In
      future, we could find more usecases for other advises so let's make it
      happens as API since we introduce a new syscall at this moment.  With
      that, existing madvise(2) user could replace it with process_madvise(2)
      with their own pid if they want to have batch address ranges support
      feature.
      
      ince it could affect other process's address range, only privileged
      process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same
      UID) gives it the right to ptrace the process could use it successfully.
      The flag argument is reserved for future use if we need to extend the API.
      
      I think supporting all hints madvise has/will supported/support to
      process_madvise is rather risky.  Because we are not sure all hints make
      sense from external process and implementation for the hint may rely on
      the caller being in the current context so it could be error-prone.  Thus,
      I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
      
      If someone want to add other hints, we could hear the usecase and review
      it for each hint.  It's safer for maintenance rather than introducing a
      buggy syscall but hard to fix it later.
      
      So finally, the API is as follows,
      
            ssize_t process_madvise(int pidfd, const struct iovec *iovec,
                      unsigned long vlen, int advice, unsigned int flags);
      
          DESCRIPTION
            The process_madvise() system call is used to give advice or directions
            to the kernel about the address ranges from external process as well as
            local process. It provides the advice to address ranges of process
            described by iovec and vlen. The goal of such advice is to improve
            system or application performance.
      
            The pidfd selects the process referred to by the PID file descriptor
            specified in pidfd. (See pidofd_open(2) for further information)
      
            The pointer iovec points to an array of iovec structures, defined in
            <sys/uio.h> as:
      
              struct iovec {
                  void *iov_base;         /* starting address */
                  size_t iov_len;         /* number of bytes to be advised */
              };
      
            The iovec describes address ranges beginning at address(iov_base)
            and with size length of bytes(iov_len).
      
            The vlen represents the number of elements in iovec.
      
            The advice is indicated in the advice argument, which is one of the
            following at this moment if the target process specified by pidfd is
            external.
      
              MADV_COLD
              MADV_PAGEOUT
      
            Permission to provide a hint to external process is governed by a
            ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).
      
            The process_madvise supports every advice madvise(2) has if target
            process is in same thread group with calling process so user could
            use process_madvise(2) to extend existing madvise(2) to support
            vector address ranges.
      
          RETURN VALUE
            On success, process_madvise() returns the number of bytes advised.
            This return value may be less than the total number of requested
            bytes, if an error occurred. The caller should check return value
            to determine whether a partial advice occurred.
      
      FAQ:
      
      Q.1 - Why does any external entity have better knowledge?
      
      Quote from Sandeep
      
      "For Android, every application (including the special SystemServer)
      are forked from Zygote.  The reason of course is to share as many
      libraries and classes between the two as possible to benefit from the
      preloading during boot.
      
      After applications start, (almost) all of the APIs end up calling into
      this SystemServer process over IPC (binder) and back to the
      application.
      
      In a fully running system, the SystemServer monitors every single
      process periodically to calculate their PSS / RSS and also decides
      which process is "important" to the user for interactivity.
      
      So, because of how these processes start _and_ the fact that the
      SystemServer is looping to monitor each process, it does tend to *know*
      which address range of the application is not used / useful.
      
      Besides, we can never rely on applications to clean things up
      themselves.  We've had the "hey app1, the system is low on memory,
      please trim your memory usage down" notifications for a long time[1].
      They rely on applications honoring the broadcasts and very few do.
      
      So, if we want to avoid the inevitable killing of the application and
      restarting it, some way to be able to tell the OS about unimportant
      memory in these applications will be useful.
      
      - ssp
      
      Q.2 - How to guarantee the race(i.e., object validation) between when
      giving a hint from an external process and get the hint from the target
      process?
      
      process_madvise operates on the target process's address space as it
      exists at the instant that process_madvise is called.  If the space
      target process can run between the time the process_madvise process
      inspects the target process address space and the time that
      process_madvise is actually called, process_madvise may operate on
      memory regions that the calling process does not expect.  It's the
      responsibility of the process calling process_madvise to close this
      race condition.  For example, the calling process can suspend the
      target process with ptrace, SIGSTOP, or the freezer cgroup so that it
      doesn't have an opportunity to change its own address space before
      process_madvise is called.  Another option is to operate on memory
      regions that the caller knows a priori will be unchanged in the target
      process.  Yet another option is to accept the race for certain
      process_madvise calls after reasoning that mistargeting will do no
      harm.  The suggested API itself does not provide synchronization.  It
      also apply other APIs like move_pages, process_vm_write.
      
      The race isn't really a problem though.  Why is it so wrong to require
      that callers do their own synchronization in some manner?  Nobody
      objects to write(2) merely because it's possible for two processes to
      open the same file and clobber each other's writes --- instead, we tell
      people to use flock or something.  Think about mmap.  It never
      guarantees newly allocated address space is still valid when the user
      tries to access it because other threads could unmap the memory right
      before.  That's where we need synchronization by using other API or
      design from userside.  It shouldn't be part of API itself.  If someone
      needs more fine-grained synchronization rather than process level,
      there were two ideas suggested - cookie[2] and anon-fd[3].  Both are
      applicable via using last reserved argument of the API but I don't
      think it's necessary right now since we have already ways to prevent
      the race so don't want to add additional complexity with more
      fine-grained optimization model.
      
      To make the API extend, it reserved an unsigned long as last argument
      so we could support it in future if someone really needs it.
      
      Q.3 - Why doesn't ptrace work?
      
      Injecting an madvise in the target process using ptrace would not work
      for us because such injected madvise would have to be executed by the
      target process, which means that process would have to be runnable and
      that creates the risk of the abovementioned race and hinting a wrong
      VMA.  Furthermore, we want to act the hint in caller's context, not the
      callee's, because the callee is usually limited in cpuset/cgroups or
      even freezed state so they can't act by themselves quick enough, which
      causes more thrashing/kill.  It doesn't work if the target process are
      ptraced(e.g., strace, debugger, minidump) because a process can have at
      most one ptracer.
      
      [1] https://developer.android.com/topic/performance/memory"
      
      [2] process_getinfo for getting the cookie which is updated whenever
          vma of process address layout are changed - Daniel Colascione -
          https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224
      
      [3] anonymous fd which is used for the object(i.e., address range)
          validation - Michal Hocko -
          https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/
      
      [minchan@kernel.org: fix process_madvise build break for arm64]
        Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
      [minchan@kernel.org: fix build error for mips of process_madvise]
        Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
      [akpm@linux-foundation.org: fix patch ordering issue]
      [akpm@linux-foundation.org: fix arm64 whoops]
      [minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian]
      [akpm@linux-foundation.org: fix i386 build]
      [sfr@canb.auug.org.au: fix syscall numbering]
        Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au
      [sfr@canb.auug.org.au: madvise.c needs compat.h]
        Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au
      [minchan@kernel.org: fix mips build]
        Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com
      [yuehaibing@huawei.com: remove duplicate header which is included twice]
        Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com
      [minchan@kernel.org: do not use helper functions for process_madvise]
        Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com
      [akpm@linux-foundation.org: pidfd_get_pid() gained an argument]
      [sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"]
        Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.au
      
      
      
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Dias <joaodias@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: SeongJae Park <sj38.park@gmail.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Florian Weimer <fw@deneb.enyo.de>
      Cc: <linux-man@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
      Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
      Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ecb8ac8b
  23. Oct 14, 2020
  24. Oct 03, 2020
  25. Aug 14, 2020
    • Xiaoming Ni's avatar
      all arch: remove system call sys_sysctl · 88db0aa2
      Xiaoming Ni authored
      Since commit 61a47c1a ("sysctl: Remove the sysctl system call"),
      sys_sysctl is actually unavailable: any input can only return an error.
      
      We have been warning about people using the sysctl system call for years
      and believe there are no more users.  Even if there are users of this
      interface if they have not complained or fixed their code by now they
      probably are not going to, so there is no point in warning them any
      longer.
      
      So completely remove sys_sysctl on all architectures.
      
      [nixiaoming@huawei.com: s390: fix build error for sys_call_table_emu]
       Link: http://lkml.kernel.org/r/20200618141426.16884-1-nixiaoming@huawei.com
      
      
      
      Signed-off-by: default avatarXiaoming Ni <nixiaoming@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: Will Deacon <will@kernel.org>		[arm/arm64]
      Acked-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Bin Meng <bin.meng@windriver.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: chenzefeng <chenzefeng2@huawei.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Howells <dhowells@redhat.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Diego Elio Pettenò <flameeyes@flameeyes.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kars de Jong <jongk@linux-m68k.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Krzysztof Kozlowski <krzk@kernel.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Olof Johansson <olof@lixom.net>
      Cc: Paul Burton <paulburton@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Cc: Sven Schnelle <svens@stackframe.org>
      Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Zhou Yanjie <zhouyanjie@wanyeetech.com>
      Link: http://lkml.kernel.org/r/20200616030734.87257-1-nixiaoming@huawei.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88db0aa2
  26. Jul 19, 2020
    • Christoph Hellwig's avatar
      net: remove compat_sys_{get,set}sockopt · 55db9c0e
      Christoph Hellwig authored
      
      Now that the ->compat_{get,set}sockopt proto_ops methods are gone
      there is no good reason left to keep the compat syscalls separate.
      
      This fixes the odd use of unsigned int for the compat_setsockopt
      optlen and the missing sock_use_custom_sol_socket.
      
      It would also easily allow running the eBPF hooks for the compat
      syscalls, but such a large change in behavior does not belong into
      a consolidation patch like this one.
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      55db9c0e
  27. Jun 16, 2020
    • Christian Brauner's avatar
      arch: wire-up close_range() · 9b4feb63
      Christian Brauner authored
      
      This wires up the close_range() syscall into all arches at once.
      
      Suggested-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Jann Horn <jannh@google.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Dmitry V. Levin <ldv@altlinux.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-alpha@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-ia64@vger.kernel.org
      Cc: linux-m68k@lists.linux-m68k.org
      Cc: linux-mips@vger.kernel.org
      Cc: linux-parisc@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-s390@vger.kernel.org
      Cc: linux-sh@vger.kernel.org
      Cc: sparclinux@vger.kernel.org
      Cc: linux-xtensa@linux-xtensa.org
      Cc: linux-arch@vger.kernel.org
      Cc: x86@kernel.org
      9b4feb63
  28. May 14, 2020
    • Miklos Szeredi's avatar
      vfs: add faccessat2 syscall · c8ffd8bc
      Miklos Szeredi authored
      
      POSIX defines faccessat() as having a fourth "flags" argument, while the
      linux syscall doesn't have it.  Glibc tries to emulate AT_EACCESS and
      AT_SYMLINK_NOFOLLOW, but AT_EACCESS emulation is broken.
      
      Add a new faccessat(2) syscall with the added flags argument and implement
      both flags.
      
      The value of AT_EACCESS is defined in glibc headers to be the same as
      AT_REMOVEDIR.  Use this value for the kernel interface as well, together
      with the explanatory comment.
      
      Also add AT_EMPTY_PATH support, which is not documented by POSIX, but can
      be useful and is trivial to implement.
      
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      c8ffd8bc
  29. Mar 21, 2020
  30. Jan 18, 2020
    • Aleksa Sarai's avatar
      open: introduce openat2(2) syscall · fddb5d43
      Aleksa Sarai authored
      /* Background. */
      For a very long time, extending openat(2) with new features has been
      incredibly frustrating. This stems from the fact that openat(2) is
      possibly the most famous counter-example to the mantra "don't silently
      accept garbage from userspace" -- it doesn't check whether unknown flags
      are present[1].
      
      This means that (generally) the addition of new flags to openat(2) has
      been fraught with backwards-compatibility issues (O_TMPFILE has to be
      defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
      kernels gave errors, since it's insecure to silently ignore the
      flag[2]). All new security-related flags therefore have a tough road to
      being added to openat(2).
      
      Userspace also has a hard time figuring out whether a particular flag is
      supported on a particular kernel. While it is now possible with
      contemporary kernels (thanks to [3]), older kernels will expose unknown
      flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
      openat(2) time matches modern syscall designs and is far more
      fool-proof.
      
      In addition, the newly-added path resolution restriction LOOKUP flags
      (which we would like to expose to user-space) don't feel related to the
      pre-existing O_* flag set -- they affect all components of path lookup.
      We'd therefore like to add a new flag argument.
      
      Adding a new syscall allows us to finally fix the flag-ignoring problem,
      and we can make it extensible enough so that we will hopefully never
      need an openat3(2).
      
      /* Syscall Prototype. */
        /*
         * open_how is an extensible structure (similar in interface to
         * clone3(2) or sched_setattr(2)). The size parameter must be set to
         * sizeof(struct open_how), to allow for future extensions. All future
         * extensions will be appended to open_how, with their zero value
         * acting as a no-op default.
         */
        struct open_how { /* ... */ };
      
        int openat2(int dfd, const char *pathname,
                    struct open_how *how, size_t size);
      
      /* Description. */
      The initial version of 'struct open_how' contains the following fields:
      
        flags
          Used to specify openat(2)-style flags. However, any unknown flag
          bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
          will result in -EINVAL. In addition, this field is 64-bits wide to
          allow for more O_ flags than currently permitted with openat(2).
      
        mode
          The file mode for O_CREAT or O_TMPFILE.
      
          Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.
      
        resolve
          Restrict path resolution (in contrast to O_* flags they affect all
          path components). The current set of flags are as follows (at the
          moment, all of the RESOLVE_ flags are implemented as just passing
          the corresponding LOOKUP_ flag).
      
          RESOLVE_NO_XDEV       => LOOKUP_NO_XDEV
          RESOLVE_NO_SYMLINKS   => LOOKUP_NO_SYMLINKS
          RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
          RESOLVE_BENEATH       => LOOKUP_BENEATH
          RESOLVE_IN_ROOT       => LOOKUP_IN_ROOT
      
      open_how does not contain an embedded size field, because it is of
      little benefit (userspace can figure out the kernel open_how size at
      runtime fairly easily without it). It also only contains u64s (even
      though ->mode arguably should be a u16) to avoid having padding fields
      which are never used in the future.
      
      Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
      is no longer permitted for openat(2). As far as I can tell, this has
      always been a bug and appears to not be used by userspace (and I've not
      seen any problems on my machines by disallowing it). If it turns out
      this breaks something, we can special-case it and only permit it for
      openat(2) but not openat2(2).
      
      After input from Florian Weimer, the new open_how and flag definitions
      are inside a separate header from uapi/linux/fcntl.h, to avoid problems
      that glibc has with importing that header.
      
      /* Testing. */
      In a follow-up patch there are over 200 selftests which ensure that this
      syscall has the correct semantics and will correctly handle several
      attack scenarios.
      
      In addition, I've written a userspace library[4] which provides
      convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
      because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
      must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
      syscalls). During the development of this patch, I've run numerous
      verification tests using libpathrs (showing that the API is reasonably
      usable by userspace).
      
      /* Future Work. */
      Additional RESOLVE_ flags have been suggested during the review period.
      These can be easily implemented separately (such as blocking auto-mount
      during resolution).
      
      Furthermore, there are some other proposed changes to the openat(2)
      interface (the most obvious example is magic-link hardening[5]) which
      would be a good opportunity to add a way for userspace to restrict how
      O_PATH file descriptors can be re-opened.
      
      Another possible avenue of future work would be some kind of
      CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
      which openat2(2) flags and fields are supported by the current kernel
      (to avoid userspace having to go through several guesses to figure it
      out).
      
      [1]: https://lwn.net/Articles/588444/
      [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
      [3]: commit 629e014b ("fs: completely ignore unknown open flags")
      [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
      [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
      [6]: https://youtu.be/ggD-eb3yPVs
      
      
      
      Suggested-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      fddb5d43
  31. Jan 13, 2020
  32. Jun 28, 2019
    • Christian Brauner's avatar
      arch: wire-up pidfd_open() · 7615d9e1
      Christian Brauner authored
      
      This wires up the pidfd_open() syscall into all arches at once.
      
      Signed-off-by: default avatarChristian Brauner <christian@brauner.io>
      Reviewed-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jann Horn <jannh@google.com>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-api@vger.kernel.org
      Cc: linux-alpha@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-ia64@vger.kernel.org
      Cc: linux-m68k@lists.linux-m68k.org
      Cc: linux-mips@vger.kernel.org
      Cc: linux-parisc@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-s390@vger.kernel.org
      Cc: linux-sh@vger.kernel.org
      Cc: sparclinux@vger.kernel.org
      Cc: linux-xtensa@linux-xtensa.org
      Cc: linux-arch@vger.kernel.org
      Cc: x86@kernel.org
      7615d9e1
  33. Jun 09, 2019
    • Christian Brauner's avatar
      arch: wire-up clone3() syscall · 8f3220a8
      Christian Brauner authored
      
      Wire up the clone3() call on all arches that don't require hand-rolled
      assembly.
      
      Some of the arches look like they need special assembly massaging and it is
      probably smarter if the appropriate arch maintainers would do the actual
      wiring. Arches that are wired-up are:
      - x86{_32,64}
      - arm{64}
      - xtensa
      
      Signed-off-by: default avatarChristian Brauner <christian@brauner.io>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Adrian Reber <adrian@lisas.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: x86@kernel.org
      8f3220a8
  34. May 16, 2019