Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Jun 07, 2024
  2. Jun 05, 2024
  3. Jun 04, 2024
  4. Jun 01, 2024
  5. May 30, 2024
    • Damien Le Moal's avatar
      block: Fix zone write plugging handling of devices with a runt zone · 29459c3e
      Damien Le Moal authored
      A zoned device may have a last sequential write required zone that is
      smaller than other zones. However, all tests to check if a zone write
      plug write offset exceeds the zone capacity use the same capacity
      value stored in the gendisk zone_capacity field. This is incorrect for a
      zoned device with a last runt (smaller) zone.
      
      Add the new field last_zone_capacity to struct gendisk to store the
      capacity of the last zone of the device. blk_revalidate_seq_zone() and
      blk_revalidate_conv_zone() are both modified to get this value when
      disk_zone_is_last() returns true. Similarly to zone_capacity, the value
      is first stored using the last_zone_capacity field of struct
      blk_revalidate_zone_args. Once zone revalidation of all zones is done,
      this is used to set the gendisk last_zone_capacity field.
      
      The checks to determine if a zone is full or if a sector offset in a
      zone exceeds the zone capacity in disk_should_remove_zone_wplug(),
      disk_zone_wplug_abort_unaligned(), blk_zone_write_plug_init_request(),
      and blk_zone_wplug_prepare_bio() are modified to use the new helper
      functions disk_zone_is_full() and disk_zone_wplug_is_full().
      disk_zone_is_full() uses the zone index to determine if the zone being
      tested is the last one of the disk and uses the either the disk
      zone_capacity or last_zone_capacity accordingly.
      
      Fixes: dd291d77
      
       ("block: Introduce zone write plugging")
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Reviewed-by: default avatarNiklas Cassel <cassel@kernel.org>
      Link: https://lore.kernel.org/r/20240530054035.491497-4-dlemoal@kernel.org
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      29459c3e
    • Jakub Kicinski's avatar
      netdev: add qstat for csum complete · 13c7c941
      Jakub Kicinski authored
      Recent commit 0cfe71f4
      
       ("netdev: add queue stats") added
      a lot of useful stats, but only those immediately needed by virtio.
      Presumably virtio does not support CHECKSUM_COMPLETE,
      so statistic for that form of checksumming wasn't included.
      Other drivers will definitely need it, in fact we expect it
      to be needed in net-next soon (mlx5). So let's add the definition
      of the counter for CHECKSUM_COMPLETE to uAPI in net already,
      so that the counters are in a more natural order (all subsequent
      counters have not been present in any released kernel, yet).
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Reviewed-by: default avatarJoe Damato <jdamato@fastly.com>
      Fixes: 0cfe71f4 ("netdev: add queue stats")
      Link: https://lore.kernel.org/r/20240529163547.3693194-1-kuba@kernel.org
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      13c7c941
  6. May 29, 2024
  7. May 28, 2024
  8. May 27, 2024
  9. May 26, 2024
  10. May 25, 2024
    • Daniel Borkmann's avatar
      netkit: Fix pkt_type override upon netkit pass verdict · 3998d184
      Daniel Borkmann authored
      When running Cilium connectivity test suite with netkit in L2 mode, we
      found that compared to tcx a few tests were failing which pushed traffic
      into an L7 proxy sitting in host namespace. The problem in particular is
      around the invocation of eth_type_trans() in netkit.
      
      In case of tcx, this is run before the tcx ingress is triggered inside
      host namespace and thus if the BPF program uses the bpf_skb_change_type()
      helper the newly set type is retained. However, in case of netkit, the
      late eth_type_trans() invocation overrides the earlier decision from the
      BPF program which eventually leads to the test failure.
      
      Instead of eth_type_trans(), split out the relevant parts, meaning, reset
      of mac header and call to eth_skb_pkt_type() before the BPF program is run
      in order to have the same behavior as with tcx, and refactor a small helper
      called eth_skb_pull_mac() which is run in case it's passed up the stack
      where the mac header must be pulled. With this all ...
      3998d184
  11. May 24, 2024
  12. May 23, 2024
    • Jeff Xu's avatar
      mseal: add mseal syscall · 8be7258a
      Jeff Xu authored
      The new mseal() is an syscall on 64 bit CPU, and with following signature:
      
      int mseal(void addr, size_t len, unsigned long flags)
      addr/len: memory range.
      flags: reserved.
      
      mseal() blocks following operations for the given memory range.
      
      1> Unmapping, moving to another location, and shrinking the size,
         via munmap() and mremap(), can leave an empty space, therefore can
         be replaced with a VMA with a new set of attributes.
      
      2> Moving or expanding a different VMA into the current location,
         via mremap().
      
      3> Modifying a VMA via mmap(MAP_FIXED).
      
      4> Size expansion, via mremap(), does not appear to pose any specific
         risks to sealed VMAs. It is included anyway because the use case is
         unclear. In any case, users can rely on merging to expand a sealed VMA.
      
      5> mprotect() and pkey_mprotect().
      
      6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
         memory, when users don't have write permission to the memory. Those
         behaviors can alter region contents by...
      8be7258a
    • Jeff Xu's avatar
      mseal: wire up mseal syscall · ff388fe5
      Jeff Xu authored
      Patch series "Introduce mseal", v10.
      
      This patchset proposes a new mseal() syscall for the Linux kernel.
      
      In a nutshell, mseal() protects the VMAs of a given virtual memory range
      against modifications, such as changes to their permission bits.
      
      Modern CPUs support memory permissions, such as the read/write (RW) and
      no-execute (NX) bits.  Linux has supported NX since the release of kernel
      version 2.6.8 in August 2004 [1].  The memory permission feature improves
      the security stance on memory corruption bugs, as an attacker cannot
      simply write to arbitrary memory and point the code to it.  The memory
      must be marked with the X bit, or else an exception will occur. 
      Internally, the kernel maintains the memory permissions in a data
      structure called VMA (vm_area_struct).  mseal() additionally protects the
      VMA itself against modifications of the selected seal type.
      
      Memory sealing is useful to mitigate memory corruption issues where a
      corrupted pointer is passed to a memory management...
      ff388fe5
    • Heiner Kallweit's avatar
      i2c: Remove I2C_CLASS_SPD · e61bcf42
      Heiner Kallweit authored
      
      Remove this class after all users have been gone.
      
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: default avatarAndi Shyti <andi.shyti@kernel.org>
      e61bcf42
  13. May 22, 2024
    • Steven Rostedt (Google)'s avatar
      tracing/treewide: Remove second parameter of __assign_str() · 2c92ca84
      Steven Rostedt (Google) authored
      With the rework of how the __string() handles dynamic strings where it
      saves off the source string in field in the helper structure[1], the
      assignment of that value to the trace event field is stored in the helper
      value and does not need to be passed in again.
      
      This means that with:
      
        __string(field, mystring)
      
      Which use to be assigned with __assign_str(field, mystring), no longer
      needs the second parameter and it is unused. With this, __assign_str()
      will now only get a single parameter.
      
      There's over 700 users of __assign_str() and because coccinelle does not
      handle the TRACE_EVENT() macro I ended up using the following sed script:
      
        git grep -l __assign_str | while read a ; do
            sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file;
            mv /tmp/test-file $a;
        done
      
      I then searched for __assign_str() that did not end with ';' as those
      were multi line assignments that the sed script above would fail to catch.
      
      Note, the same updates will need to be done for:
      
        __assign_str_len()
        __assign_rel_str()
        __assign_rel_str_len()
      
      I tested this with both an allmodconfig and an allyesconfig (build only for both).
      
      [1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240516133454.681ba6a0@rorschach.local.home
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Julia Lawall <Julia.Lawall@inria.fr>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      Acked-by: default avatarJani Nikula <jani.nikula@intel.com>
      Acked-by: Christian König <christian.koenig@amd.com> for the amdgpu parts.
      Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #for
      Acked-by: Rafael J. Wysocki <rafael@kernel.org> # for thermal
      Acked-by: default avatarTakashi Iwai <tiwai@suse.de>
      Acked-by: Darrick J. Wong <djwong@kernel.org>	# xfs
      Tested-by: default avatarGuenter Roeck <linux@roeck-us.net>
      2c92ca84
    • Puranjay Mohan's avatar
      ftrace: riscv: move from REGS to ARGS · 7caa9765
      Puranjay Mohan authored
      This commit replaces riscv's support for FTRACE_WITH_REGS with support
      for FTRACE_WITH_ARGS. This is required for the ongoing effort to stop
      relying on stop_machine() for RISCV's implementation of ftrace.
      
      The main relevant benefit that this change will bring for the above
      use-case is that now we don't have separate ftrace_caller and
      ftrace_regs_caller trampolines. This will allow the callsite to call
      ftrace_caller by modifying a single instruction. Now the callsite can
      do something similar to:
      
      When not tracing:            |             When tracing:
      
      func:                                      func:
        auipc t0, ftrace_caller_top                auipc t0, ftrace_caller_top
        nop  <=========<Enable/Disable>=========>  jalr  t0, ftrace_caller_bottom
        [...]                                      [...]
      
      The above assumes that we are dropping the support of calling a direct
      trampoline from the callsite. We need to drop this as the callsite can't
      change the target address to call, it can only enable/disable a call to
      a preset target (ftrace_caller in the above diagram). We can later optimize
      this by calling an intermediate dispatcher trampoline before ftrace_caller.
      
      Currently, ftrace_regs_caller saves all CPU registers in the format of
      struct pt_regs and allows the tracer to modify them. We don't need to
      save all of the CPU registers because at function entry only a subset of
      pt_regs is live:
      
      |----------+----------+---------------------------------------------|
      | Register | ABI Name | Description                                 |
      |----------+----------+---------------------------------------------|
      | x1       | ra       | Return address for traced function          |
      | x2       | sp       | Stack pointer                               |
      | x5       | t0       | Return address for ftrace_caller trampoline |
      | x8       | s0/fp    | Frame pointer                               |
      | x10-11   | a0-1     | Function arguments/return values            |
      | x12-17   | a2-7     | Function arguments                          |
      |----------+----------+---------------------------------------------|
      
      See RISCV calling convention[1] for the above table.
      
      Saving just the live registers decreases the amount of stack space
      required from 288 Bytes to 112 Bytes.
      
      Basic testing was done with this on the VisionFive 2 development board.
      
      Note:
        - Moving from REGS to ARGS will mean that RISCV will stop supporting
          KPROBES_ON_FTRACE as it requires full pt_regs to be saved.
        - KPROBES_ON_FTRACE will be supplanted by FPROBES see [2].
      
      [1] https://riscv.org/wp-content/uploads/2015/01/riscv-calling.pdf
      [2] https://lore.kernel.org/all/170887410337.564249.6360118840946697039.stgit@devnote2/
      
      
      
      Signed-off-by: default avatarPuranjay Mohan <puranjay@kernel.org>
      Tested-by: default avatarBjörn Töpel <bjorn@rivosinc.com>
      Reviewed-by: default avatarBjörn Töpel <bjorn@rivosinc.com>
      Link: https://lore.kernel.org/r/20240405142453.4187-1-puranjay@kernel.org
      
      
      Signed-off-by: default avatarPalmer Dabbelt <palmer@rivosinc.com>
      7caa9765
    • Linus Torvalds's avatar
      clang: work around asm input constraint problems · dbaaabd6
      Linus Torvalds authored
      Work around clang problems with asm constraints that have multiple
      possibilities, particularly "g" and "rm".
      
      Clang seems to turn inputs like that into the most generic form, which
      is the memory input - but to make matters worse, clang won't even use a
      possible original memory location, but will spill the value to stack,
      and use the stack for the asm input.
      
      See
      
        https://github.com/llvm/llvm-project/issues/20571#issuecomment-980933442
      
      for some explanation of why clang has this strange behavior, but the end
      result is that "g" and "rm" really end up generating horrid code.
      
      Link: https://github.com/llvm/llvm-project/issues/20571
      
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dbaaabd6
    • David Hildenbrand's avatar
      virtio-mem: support suspend+resume · e4544c55
      David Hildenbrand authored
      
      With virtio-mem, primarily hibernation is problematic: as the machine shuts
      down, the virtio-mem device loses its state. Powering the machine back up
      is like losing a bunch of DIMMs. While there would be ways to add limited
      support, suspend+resume is more commonly used for VMs and "easier" to
      support cleanly.
      
      s2idle can be supported without any device dependencies. Similarly, one
      would expect suspend-to-ram (i.e., S3) to work out of the box. However,
      QEMU currently unplugs all device memory when resuming the VM, using a
      cold reset on the "wakeup" path. In order to support S3, we need a feature
      flag for the device to tell us if memory remains plugged when waking up. In
      the future, QEMU will implement this feature.
      
      So let's always support s2idle and support S3 with plugged memory only if
      the device indicates support. Block hibernation early using the PM
      notifier.
      
      Trying to hibernate now fails early:
      	# echo disk > /sys/power/state
      	[   26.455369] PM: hibernation: hibernation entry
      	[   26.458271] virtio_mem virtio0: hibernation is not supported.
      	[   26.462498] PM: hibernation: hibernation exit
      	-bash: echo: write error: Operation not permitted
      
      s2idle works even without the new feature bit:
      	# echo s2idle > /sys/power/mem_sleep
      	# echo mem > /sys/power/state
      	[   52.083725] PM: suspend entry (s2idle)
      	[   52.095950] Filesystems sync: 0.010 seconds
      	[   52.101493] Freezing user space processes
      	[   52.104213] Freezing user space processes completed (elapsed 0.001 seconds)
      	[   52.106520] OOM killer disabled.
      	[   52.107655] Freezing remaining freezable tasks
      	[   52.110880] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
      	[   52.113296] printk: Suspending console(s) (use no_console_suspend to debug)
      
      S3 does not work without the feature bit when memory is plugged:
      	# echo deep > /sys/power/mem_sleep
      	# echo mem > /sys/power/state
      	[   32.788281] PM: suspend entry (deep)
      	[   32.816630] Filesystems sync: 0.027 seconds
      	[   32.820029] Freezing user space processes
      	[   32.823870] Freezing user space processes completed (elapsed 0.001 seconds)
      	[   32.827756] OOM killer disabled.
      	[   32.829608] Freezing remaining freezable tasks
      	[   32.833842] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
      	[   32.837953] printk: Suspending console(s) (use no_console_suspend to debug)
      	[   32.916172] virtio_mem virtio0: suspend+resume with plugged memory is not supported
      	[   32.916181] virtio-pci 0000:00:02.0: PM: pci_pm_suspend(): virtio_pci_freeze+0x0/0x50 returns -1
      	[   32.916197] virtio-pci 0000:00:02.0: PM: dpm_run_callback(): pci_pm_suspend+0x0/0x170 returns -1
      	[   32.916210] virtio-pci 0000:00:02.0: PM: failed to suspend async: error -1
      
      But S3 works with the new feature bit when memory is plugged (patched
      QEMU):
      	# echo deep > /sys/power/mem_sleep
      	# echo mem > /sys/power/state
      	[   33.983694] PM: suspend entry (deep)
      	[   34.009828] Filesystems sync: 0.024 seconds
      	[   34.013589] Freezing user space processes
      	[   34.016722] Freezing user space processes completed (elapsed 0.001 seconds)
      	[   34.019092] OOM killer disabled.
      	[   34.020291] Freezing remaining freezable tasks
      	[   34.023549] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
      	[   34.026090] printk: Suspending console(s) (use no_console_suspend to debug)
      
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Message-Id: <20240318120645.105664-1-david@redhat.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      e4544c55
    • Mike Christie's avatar
      vhost_task: Handle SIGKILL by flushing work and exiting · db5247d9
      Mike Christie authored
      
      Instead of lingering until the device is closed, this has us handle
      SIGKILL by:
      
      1. marking the worker as killed so we no longer try to use it with
         new virtqueues and new flush operations.
      2. setting the virtqueue to worker mapping so no new works are queued.
      3. running all the exiting works.
      
      Suggested-by: default avatarEdward Adam Davis <eadavis@qq.com>
      Reported-and-tested-by: default avatar <syzbot+98edc2df894917b3431f@syzkaller.appspotmail.com>
      Message-Id: <tencent_546DA49414E876EEBECF2C78D26D242EE50A@qq.com>
      Signed-off-by: default avatarMike Christie <michael.christie@oracle.com>
      Message-Id: <20240316004707.45557-9-michael.christie@oracle.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      db5247d9
    • Tony Luck's avatar
      x86/cpu: Fix x86_match_cpu() to match just X86_VENDOR_INTEL · 93022482
      Tony Luck authored
      Code in v6.9 arch/x86/kernel/smpboot.c was changed by commit
      
        4db64279 ("x86/cpu: Switch to new Intel CPU model defines") from:
      
        static const struct x86_cpu_id intel_cod_cpu[] = {
                X86_MATCH_INTEL_FAM6_MODEL(HASWELL_X, 0),       /* COD */
                X86_MATCH_INTEL_FAM6_MODEL(BROADWELL_X, 0),     /* COD */
                X86_MATCH_INTEL_FAM6_MODEL(ANY, 1),             /* SNC */	<--- 443
                {}
        };
      
        static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
        {
                const struct x86_cpu_id *id = x86_match_cpu(intel_cod_cpu);
      
      to:
      
        static const struct x86_cpu_id intel_cod_cpu[] = {
                 X86_MATCH_VFM(INTEL_HASWELL_X,   0),    /* COD */
                 X86_MATCH_VFM(INTEL_BROADWELL_X, 0),    /* COD */
                 X86_MATCH_VFM(INTEL_ANY,         1),    /* SNC */
                 {}
         };
      
        static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
        {
                const struct x86_cpu_id *id = x86_match_cpu(intel_cod_cpu);
      
      On an Intel CPU with SNC enabled this code previously matched the rule on line
      443 to avoid printing messages about insane cache configuration.  The new code
      did not match any rules.
      
      Expanding the macros for the intel_cod_cpu[] array shows that the old is
      equivalent to:
      
        static const struct x86_cpu_id intel_cod_cpu[] = {
        [0] = { .vendor = 0, .family = 6, .model = 0x3F, .steppings = 0, .feature = 0, .driver_data = 0 },
        [1] = { .vendor = 0, .family = 6, .model = 0x4F, .steppings = 0, .feature = 0, .driver_data = 0 },
        [2] = { .vendor = 0, .family = 6, .model = 0x00, .steppings = 0, .feature = 0, .driver_data = 1 },
        [3] = { .vendor = 0, .family = 0, .model = 0x00, .steppings = 0, .feature = 0, .driver_data = 0 }
        }
      
      while the new code expands to:
      
        static const struct x86_cpu_id intel_cod_cpu[] = {
        [0] = { .vendor = 0, .family = 6, .model = 0x3F, .steppings = 0, .feature = 0, .driver_data = 0 },
        [1] = { .vendor = 0, .family = 6, .model = 0x4F, .steppings = 0, .feature = 0, .driver_data = 0 },
        [2] = { .vendor = 0, .family = 0, .model = 0x00, .steppings = 0, .feature = 0, .driver_data = 1 },
        [3] = { .vendor = 0, .family = 0, .model = 0x00, .steppings = 0, .feature = 0, .driver_data = 0 }
        }
      
      Looking at the code for x86_match_cpu():
      
        const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match)
        {
                 const struct x86_cpu_id *m;
                 struct cpuinfo_x86 *c = &boot_cpu_data;
      
                 for (m = match;
                      m->vendor | m->family | m->model | m->steppings | m->feature;
                      m++) {
             		...
                 }
                 return NULL;
      
      it is clear that there was no match because the ANY entry in the table (array
      index 2) is now the loop termination condition (all of vendor, family, model,
      steppings, and feature are zero).
      
      So this code was working before because the "ANY" check was looking for any
      Intel CPU in family 6. But fails now because the family is a wild card. So the
      root cause is that x86_match_cpu() has never been able to match on a rule with
      just X86_VENDOR_INTEL and all other fields set to wildcards.
      
      Add a new flags field to struct x86_cpu_id that has a bit set to indicate that
      this entry in the array is valid. Update X86_MATCH*() macros to set that bit.
      Change the end-marker check in x86_match_cpu() to just check the flags field
      for this bit.
      
      Backporter notes: The commit in Fixes is really the one that is broken:
      you can't have m->vendor as part of the loop termination conditional in
      x86_match_cpu() because it can happen - as it has happened above
      - that that whole conditional is 0 albeit vendor == 0 is a valid case
      - X86_VENDOR_INTEL is 0.
      
      However, the only case where the above happens is the SNC check added by
      4db64279 so you only need this fix if you have backported that
      other commit
      
        4db64279 ("x86/cpu: Switch to new Intel CPU model defines")
      
      Fixes: 644e9cbb
      
       ("Add driver auto probing for x86 features v4")
      Suggested-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Suggested-by: default avatarBorislav Petkov <bp@alien8.de>
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarBorislav Petkov (AMD) <bp@alien8.de>
      Cc: <stable+noautosel@kernel.org> # see above
      Link: https://lore.kernel.org/r/20240517144312.GBZkdtAOuJZCvxhFbJ@fat_crate.local
      93022482
  14. May 21, 2024