Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Sep 09, 2023
    • David Howells's avatar
      iov_iter: Kunit tests for page extraction · a3c57ab7
      David Howells authored
      
      Add some kunit tests for page extraction for ITER_BVEC, ITER_KVEC and
      ITER_XARRAY type iterators.  ITER_UBUF and ITER_IOVEC aren't dealt with
      as they require userspace VM interaction.  ITER_DISCARD isn't dealt with
      either as that can't be extracted.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3c57ab7
    • David Howells's avatar
      iov_iter: Kunit tests for copying to/from an iterator · 2d71340f
      David Howells authored
      
      Add some kunit tests for page extraction for ITER_BVEC, ITER_KVEC and
      ITER_XARRAY type iterators.  ITER_UBUF and ITER_IOVEC aren't dealt with
      as they require userspace VM interaction.  ITER_DISCARD isn't dealt with
      either as that does nothing.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d71340f
    • David Howells's avatar
      iov_iter: Fix iov_iter_extract_pages() with zero-sized entries · f741bd71
      David Howells authored
      iov_iter_extract_pages() doesn't correctly handle skipping over initial
      zero-length entries in ITER_KVEC and ITER_BVEC-type iterators.
      
      The problem is that it accidentally reduces maxsize to 0 when it
      skipping and thus runs to the end of the array and returns 0.
      
      Fix this by sticking the calculated size-to-copy in a new variable
      rather than back in maxsize.
      
      Fixes: 7d58fe73
      
       ("iov_iter: Add a function to extract a page list from an iterator")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f741bd71
  2. Sep 06, 2023
    • WANG Xuerui's avatar
      raid6: Add LoongArch SIMD recovery implementation · f2091321
      WANG Xuerui authored
      
      Similar to the syndrome calculation, the recovery algorithms also work
      on 64 bytes at a time to align with the L1 cache line size of current
      and future LoongArch cores (that we care about). Which means
      unrolled-by-4 LSX and unrolled-by-2 LASX code.
      
      The assembly is originally based on the x86 SSSE3/AVX2 ports, but
      register allocation has been redone to take advantage of LSX/LASX's 32
      vector registers, and instruction sequence has been optimized to suit
      (e.g. LoongArch can perform per-byte srl and andi on vectors, but x86
      cannot).
      
      Performance numbers measured by instrumenting the raid6test code, on a
      3A5000 system clocked at 2.5GHz:
      
      > lasx  2data: 354.987 MiB/s
      > lasx  datap: 350.430 MiB/s
      > lsx   2data: 340.026 MiB/s
      > lsx   datap: 337.318 MiB/s
      > intx1 2data: 164.280 MiB/s
      > intx1 datap: 187.966 MiB/s
      
      Because recovery algorithms are chosen solely based on priority and
      availability, lasx is marked as priority 2 and lsx priority 1. At least
      for the current generation of LoongArch micro-architectures, LASX should
      always be faster than LSX whenever supported, and have similar power
      consumption characteristics (because the only known LASX-capable uarch,
      the LA464, always compute the full 256-bit result for vector ops).
      
      Acked-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarWANG Xuerui <git@xen0n.name>
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      f2091321
    • WANG Xuerui's avatar
      raid6: Add LoongArch SIMD syndrome calculation · 8f3f06df
      WANG Xuerui authored
      
      The algorithms work on 64 bytes at a time, which is the L1 cache line
      size of all current and future LoongArch cores (that we care about), as
      confirmed by Huacai. The code is based on the generic int.uc algorithm,
      unrolled 4 times for LSX and 2 times for LASX. Further unrolling does
      not meaningfully improve the performance according to experiments.
      
      Performance numbers measured during system boot on a 3A5000 @ 2.5GHz:
      
      > raid6: lasx     gen() 12726 MB/s
      > raid6: lsx      gen() 10001 MB/s
      > raid6: int64x8  gen()  2876 MB/s
      > raid6: int64x4  gen()  3867 MB/s
      > raid6: int64x2  gen()  2531 MB/s
      > raid6: int64x1  gen()  1945 MB/s
      
      Comparison of xor() speeds (from different boots but meaningful anyway):
      
      > lasx:    11226 MB/s
      > lsx:     6395 MB/s
      > int64x4: 2147 MB/s
      
      Performance as measured by raid6test:
      
      > raid6: lasx     gen() 25109 MB/s
      > raid6: lsx      gen() 13233 MB/s
      > raid6: int64x8  gen()  4164 MB/s
      > raid6: int64x4  gen()  6005 MB/s
      > raid6: int64x2  gen()  5781 MB/s
      > raid6: int64x1  gen()  4119 MB/s
      > raid6: using algorithm lasx gen() 25109 MB/s
      > raid6: .... xor() 14439 MB/s, rmw enabled
      
      Acked-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarWANG Xuerui <git@xen0n.name>
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      8f3f06df
  3. Sep 05, 2023
  4. Aug 31, 2023
  5. Aug 28, 2023
  6. Aug 25, 2023
    • Helge Deller's avatar
      lib/clz_ctz.c: Fix __clzdi2() and __ctzdi2() for 32-bit kernels · 382d4cd1
      Helge Deller authored
      
      The gcc compiler translates on some architectures the 64-bit
      __builtin_clzll() function to a call to the libgcc function __clzdi2(),
      which should take a 64-bit parameter on 32- and 64-bit platforms.
      
      But in the current kernel code, the built-in __clzdi2() function is
      defined to operate (wrongly) on 32-bit parameters if BITS_PER_LONG ==
      32, thus the return values on 32-bit kernels are in the range from
      [0..31] instead of the expected [0..63] range.
      
      This patch fixes the in-kernel functions __clzdi2() and __ctzdi2() to
      take a 64-bit parameter on 32-bit kernels as well, thus it makes the
      functions identical for 32- and 64-bit kernels.
      
      This bug went unnoticed since kernel 3.11 for over 10 years, and here
      are some possible reasons for that:
      
       a) Some architectures have assembly instructions to count the bits and
          which are used instead of calling __clzdi2(), e.g. on x86 the bsr
          instruction and on ppc cntlz is used. On such architectures the
          wrong __clzdi2() implementation isn't used and as such the bug has
          no effect and won't be noticed.
      
       b) Some architectures link to libgcc.a, and the in-kernel weak
          functions get replaced by the correct 64-bit variants from libgcc.a.
      
       c) __builtin_clzll() and __clzdi2() doesn't seem to be used in many
          places in the kernel, and most likely only in uncritical functions,
          e.g. when printing hex values via seq_put_hex_ll(). The wrong return
          value will still print the correct number, but just in a wrong
          formatting (e.g. with too many leading zeroes).
      
       d) 32-bit kernels aren't used that much any longer, so they are less
          tested.
      
      A trivial testcase to verify if the currently running 32-bit kernel is
      affected by the bug is to look at the output of /proc/self/maps:
      
      Here the kernel uses a correct implementation of __clzdi2():
      
        root@debian:~# cat /proc/self/maps
        00010000-00019000 r-xp 00000000 08:05 787324     /usr/bin/cat
        00019000-0001a000 rwxp 00009000 08:05 787324     /usr/bin/cat
        0001a000-0003b000 rwxp 00000000 00:00 0          [heap]
        f7551000-f770d000 r-xp 00000000 08:05 794765     /usr/lib/hppa-linux-gnu/libc.so.6
        ...
      
      and this kernel uses the broken implementation of __clzdi2():
      
        root@debian:~# cat /proc/self/maps
        0000000010000-0000000019000 r-xp 00000000 000000008:000000005 787324  /usr/bin/cat
        0000000019000-000000001a000 rwxp 000000009000 000000008:000000005 787324  /usr/bin/cat
        000000001a000-000000003b000 rwxp 00000000 00:00 0  [heap]
        00000000f73d1000-00000000f758d000 r-xp 00000000 000000008:000000005 794765  /usr/lib/hppa-linux-gnu/libc.so.6
        ...
      
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Fixes: 4df87bb7
      
       ("lib: add weak clz/ctz functions")
      Cc: Chanho Min <chanho.min@lge.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: stable@vger.kernel.org # v3.11+
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      382d4cd1
    • Mateusz Guzik's avatar
      pcpcntr: add group allocation/free · c439d5e8
      Mateusz Guzik authored
      
      Allocations and frees are globally serialized on the pcpu lock (and the
      CPU hotplug lock if enabled, which is the case on Debian).
      
      At least one frequent consumer allocates 4 back-to-back counters (and
      frees them in the same manner), exacerbating the problem.
      
      While this does not fully remedy scalability issues, it is a step
      towards that goal and provides immediate relief.
      
      Signed-off-by: default avatarMateusz Guzik <mjguzik@gmail.com>
      Reviewed-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Link: https://lore.kernel.org/r/20230823050609.2228718-2-mjguzik@gmail.com
      
      
      [Dennis: reflowed a few lines]
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      c439d5e8
    • Christophe Leroy's avatar
      kunit: Fix checksum tests on big endian CPUs · b38460bc
      Christophe Leroy authored
      On powerpc64le checksum kunit tests work:
      
      [    2.011457][    T1]     KTAP version 1
      [    2.011662][    T1]     # Subtest: checksum
      [    2.011848][    T1]     1..3
      [    2.034710][    T1]     ok 1 test_csum_fixed_random_inputs
      [    2.079325][    T1]     ok 2 test_csum_all_carry_inputs
      [    2.127102][    T1]     ok 3 test_csum_no_carry_inputs
      [    2.127202][    T1] # checksum: pass:3 fail:0 skip:0 total:3
      [    2.127533][    T1] # Totals: pass:3 fail:0 skip:0 total:3
      [    2.127956][    T1] ok 1 checksum
      
      But on powerpc64 and powerpc32 they fail:
      
      [    1.859890][    T1]     KTAP version 1
      [    1.860041][    T1]     # Subtest: checksum
      [    1.860201][    T1]     1..3
      [    1.861927][   T58]     # test_csum_fixed_random_inputs: ASSERTION FAILED at lib/checksum_kunit.c:243
      [    1.861927][   T58]     Expected result == expec, but
      [    1.861927][   T58]         result == 54991 (0xd6cf)
      [    1.861927][   T58]         expec == 33316 (0x8224)
      [    1.863742][    T1]     not ok 1 test_csum_fixed_random_inputs
      [    1.864520][   T60]     # test_csum_all_carry_inputs: ASSERTION FAILED at lib/checksum_kunit.c:267
      [    1.864520][   T60]     Expected result == expec, but
      [    1.864520][   T60]         result == 255 (0xff)
      [    1.864520][   T60]         expec == 65280 (0xff00)
      [    1.868820][    T1]     not ok 2 test_csum_all_carry_inputs
      [    1.869977][   T62]     # test_csum_no_carry_inputs: ASSERTION FAILED at lib/checksum_kunit.c:306
      [    1.869977][   T62]     Expected result == expec, but
      [    1.869977][   T62]         result == 64515 (0xfc03)
      [    1.869977][   T62]         expec == 0 (0x0)
      [    1.872060][    T1]     not ok 3 test_csum_no_carry_inputs
      [    1.872102][    T1] # checksum: pass:0 fail:3 skip:0 total:3
      [    1.872458][    T1] # Totals: pass:0 fail:3 skip:0 total:3
      [    1.872791][    T1] not ok 3 checksum
      
      This is because all expected values were calculated for X86 which
      is little endian. On big endian systems all precalculated 16 bits
      halves must be byte swapped.
      
      And this is confirmed by a huge amount of sparse errors when building
      with C=2
      
      So fix all sparse errors and it will naturally work on all endianness.
      
      Fixes: 688eb819
      
       ("x86/csum: Improve performance of `csum_partial`")
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b38460bc
  7. Aug 24, 2023
    • Liam R. Howlett's avatar
      maple_tree: clean up mas_wr_append() · 432af5c9
      Liam R. Howlett authored
      Avoid setting the variables until necessary, and actually use the
      variables where applicable.  Introducing a variable for the slots array
      avoids spanning multiple lines.
      
      Add the missing argument to the documentation.
      
      Use the node type when setting the metadata instead of blindly assuming
      the type.
      
      Finally, add a trace point to the function for successful store.
      
      Link: https://lkml.kernel.org/r/20230819004356.1454718-3-Liam.Howlett@oracle.com
      
      
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      432af5c9
    • Matthew Wilcox (Oracle)'s avatar
      minmax: add in_range() macro · f9bff0e3
      Matthew Wilcox (Oracle) authored
      Patch series "New page table range API", v6.
      
      This patchset changes the API used by the MM to set up page table entries.
      The four APIs are:
      
          set_ptes(mm, addr, ptep, pte, nr)
          update_mmu_cache_range(vma, addr, ptep, nr)
          flush_dcache_folio(folio) 
          flush_icache_pages(vma, page, nr)
      
      flush_dcache_folio() isn't technically new, but no architecture
      implemented it, so I've done that for them.  The old APIs remain around
      but are mostly implemented by calling the new interfaces.
      
      The new APIs are based around setting up N page table entries at once. 
      The N entries belong to the same PMD, the same folio and the same VMA, so
      ptep++ is a legitimate operation, and locking is taken care of for you. 
      Some architectures can do a better job of it than just a loop, but I have
      hesitated to make too deep a change to architectures I don't understand
      well.
      
      One thing I have changed in every architecture is that PG_arch_1 is now a
      per-folio bit instead of a per-page bit when used for dcache clean/dirty
      tracking.  This was something that would have to happen eventually, and it
      makes sense to do it now rather than iterate over every page involved in a
      cache flush and figure out if it needs to happen.
      
      The point of all this is better performance, and Fengwei Yin has measured
      improvement on x86.  I suspect you'll see improvement on your architecture
      too.  Try the new will-it-scale test mentioned here:
      https://lore.kernel.org/linux-mm/20230206140639.538867-5-fengwei.yin@intel.com/
      You'll need to run it on an XFS filesystem and have
      CONFIG_TRANSPARENT_HUGEPAGE set.
      
      This patchset is the basis for much of the anonymous large folio work
      being done by Ryan, so it's received quite a lot of testing over the last
      few months.
      
      
      This patch (of 38):
      
      Determine if a value lies within a range more efficiently (subtraction +
      comparison vs two comparisons and an AND).  It also has useful (under some
      circumstances) behaviour if the range exceeds the maximum value of the
      type.  Convert all the conflicting definitions of in_range() within the
      kernel; some can use the generic definition while others need their own
      definition.
      
      Link: https://lkml.kernel.org/r/20230802151406.3735276-1-willy@infradead.org
      Link: https://lkml.kernel.org/r/20230802151406.3735276-2-willy@infradead.org
      
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f9bff0e3
    • Liam R. Howlett's avatar
      maple_tree: disable mas_wr_append() when other readers are possible · cfeb6ae8
      Liam R. Howlett authored
      The current implementation of append may cause duplicate data and/or
      incorrect ranges to be returned to a reader during an update.  Although
      this has not been reported or seen, disable the append write operation
      while the tree is in rcu mode out of an abundance of caution.
      
      During the analysis of the mas_next_slot() the following was
      artificially created by separating the writer and reader code:
      
      Writer:                                 reader:
      mas_wr_append
          set end pivot
          updates end metata
          Detects write to last slot
          last slot write is to start of slot
          store current contents in slot
          overwrite old end pivot
                                              mas_next_slot():
                                                      read end metadata
                                                      read old end pivot
                                                      return with incorrect range
          store new value
      
      Alternatively:
      
      Writer:                                 reader:
      mas_wr_append
          set end pivot
          updates end metata
          Detects write to last slot
          last lost write to end of slot
          store value
                                              mas_next_slot():
                                                      read end metadata
                                                      read old end pivot
                                                      read new end pivot
                                                      return with incorrect range
          set old end pivot
      
      There may be other accesses that are not safe since we are now updating
      both metadata and pointers, so disabling append if there could be rcu
      readers is the safest action.
      
      Link: https://lkml.kernel.org/r/20230819004356.1454718-2-Liam.Howlett@oracle.com
      Fixes: 54a611b6
      
       ("Maple Tree: add new data structure")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cfeb6ae8
  8. Aug 21, 2023
  9. Aug 19, 2023
  10. Aug 18, 2023