Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Jul 12, 2024
  2. Jul 11, 2024
  3. Jul 09, 2024
    • Matt Bobrowski's avatar
      bpf: relax zero fixed offset constraint on KF_TRUSTED_ARGS/KF_RCU · 605c9699
      Matt Bobrowski authored
      Currently, BPF kfuncs which accept trusted pointer arguments
      i.e. those flagged as KF_TRUSTED_ARGS, KF_RCU, or KF_RELEASE, all
      require an original/unmodified trusted pointer argument to be supplied
      to them. By original/unmodified, it means that the backing register
      holding the trusted pointer argument that is to be supplied to the BPF
      kfunc must have its fixed offset set to zero, or else the BPF verifier
      will outright reject the BPF program load. However, this zero fixed
      offset constraint that is currently enforced by the BPF verifier onto
      BPF kfuncs specifically flagged to accept KF_TRUSTED_ARGS or KF_RCU
      trusted pointer arguments is rather unnecessary, and can limit their
      usability in practice. Specifically, it completely eliminates the
      possibility of constructing a derived trusted pointer from an original
      trusted pointer. To put it simply, a derived pointer is a pointer
      which points to one of the nested member fields of the object being
      pointed to by the original trusted pointer.
      
      This patch relaxes the zero fixed offset constraint that is enforced
      upon BPF kfuncs which specifically accept KF_TRUSTED_ARGS, or KF_RCU
      arguments. Although, the zero fixed offset constraint technically also
      applies to BPF kfuncs accepting KF_RELEASE arguments, relaxing this
      constraint for such BPF kfuncs has subtle and unwanted
      side-effects. This was discovered by experimenting a little further
      with an initial version of this patch series [0]. The primary issue
      with relaxing the zero fixed offset constraint on BPF kfuncs accepting
      KF_RELEASE arguments is that it'd would open up the opportunity for
      BPF programs to supply both trusted pointers and derived trusted
      pointers to them. For KF_RELEASE BPF kfuncs specifically, this could
      be problematic as resources associated with the backing pointer could
      be released by the backing BPF kfunc and cause instabilities for the
      rest of the kernel.
      
      With this new fixed offset semantic in-place for BPF kfuncs accepting
      KF_TRUSTED_ARGS and KF_RCU arguments, we now have more flexibility
      when it comes to the BPF kfuncs that we're able to introduce moving
      forward.
      
      Early discussions covering the possibility of relaxing the zero fixed
      offset constraint can be found using the link below. This will provide
      more context on where all this has stemmed from [1].
      
      Notably, pre-existing tests have been updated such that they provide
      coverage for the updated zero fixed offset
      functionality. Specifically, the nested offset test was converted from
      a negative to positive test as it was already designed to assert zero
      fixed offset semantics of a KF_TRUSTED_ARGS BPF kfunc.
      
      [0] https://lore.kernel.org/bpf/ZnA9ndnXKtHOuYMe@google.com/
      [1] https://lore.kernel.org/bpf/ZhkbrM55MKQ0KeIV@google.com/
      
      
      
      Signed-off-by: default avatarMatt Bobrowski <mattbobrowski@google.com>
      Acked-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20240709210939.1544011-1-mattbobrowski@google.com
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      605c9699
  4. Jul 08, 2024
  5. Jul 03, 2024
  6. Jul 02, 2024
  7. Jul 01, 2024
  8. Jun 27, 2024
    • Arnd Bergmann's avatar
      kallsyms: rework symbol lookup return codes · 7e1f4eb9
      Arnd Bergmann authored
      Building with W=1 in some configurations produces a false positive
      warning for kallsyms:
      
      kernel/kallsyms.c: In function '__sprint_symbol.isra':
      kernel/kallsyms.c:503:17: error: 'strcpy' source argument is the same as destination [-Werror=restrict]
        503 |                 strcpy(buffer, name);
            |                 ^~~~~~~~~~~~~~~~~~~~
      
      This originally showed up while building with -O3, but later started
      happening in other configurations as well, depending on inlining
      decisions. The underlying issue is that the local 'name' variable is
      always initialized to the be the same as 'buffer' in the called functions
      that fill the buffer, which gcc notices while inlining, though it could
      see that the address check always skips the copy.
      
      The calling conventions here are rather unusual, as all of the internal
      lookup functions (bpf_address_lookup, ftrace_mod_address_lookup,
      ftrace_func_address_lookup, module_address_lookup and
      kallsyms_lookup_buildid) already use the provided buffer and either return
      the address of that buffer to indicate success, or NULL for failure,
      but the callers are written to also expect an arbitrary other buffer
      to be returned.
      
      Rework the calling conventions to return the length of the filled buffer
      instead of its address, which is simpler and easier to follow as well
      as avoiding the warning. Leave only the kallsyms_lookup() calling conventions
      unchanged, since that is called from 16 different functions and
      adapting this would be a much bigger change.
      
      Link: https://lore.kernel.org/lkml/20200107214042.855757-1-arnd@arndb.de/
      Link: https://lore.kernel.org/lkml/20240326130647.7bfb1d92@gandalf.local.home/
      
      
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Reviewed-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Acked-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      7e1f4eb9
  9. Jun 26, 2024
    • Matt Bobrowski's avatar
      bpf: add missing check_func_arg_reg_off() to prevent out-of-bounds memory accesses · ec2b9a5e
      Matt Bobrowski authored
      
      Currently, it's possible to pass in a modified CONST_PTR_TO_DYNPTR to
      a global function as an argument. The adverse effects of this is that
      BPF helpers can continue to make use of this modified
      CONST_PTR_TO_DYNPTR from within the context of the global function,
      which can unintentionally result in out-of-bounds memory accesses and
      therefore compromise overall system stability i.e.
      
      [  244.157771] BUG: KASAN: slab-out-of-bounds in bpf_dynptr_data+0x137/0x140
      [  244.161345] Read of size 8 at addr ffff88810914be68 by task test_progs/302
      [  244.167151] CPU: 0 PID: 302 Comm: test_progs Tainted: G O E 6.10.0-rc3-00131-g66b586715063 #533
      [  244.174318] Call Trace:
      [  244.175787]  <TASK>
      [  244.177356]  dump_stack_lvl+0x66/0xa0
      [  244.179531]  print_report+0xce/0x670
      [  244.182314]  ? __virt_addr_valid+0x200/0x3e0
      [  244.184908]  kasan_report+0xd7/0x110
      [  244.187408]  ? bpf_dynptr_data+0x137/0x140
      [  244.189714]  ? bpf_dynptr_data+0x137/0x140
      [  244.192020]  bpf_dynptr_data+0x137/0x140
      [  244.194264]  bpf_prog_b02a02fdd2bdc5fa_global_call_bpf_dynptr_data+0x22/0x26
      [  244.198044]  bpf_prog_b0fe7b9d7dc3abde_callback_adjust_bpf_dynptr_reg_off+0x1f/0x23
      [  244.202136]  bpf_user_ringbuf_drain+0x2c7/0x570
      [  244.204744]  ? 0xffffffffc0009e58
      [  244.206593]  ? __pfx_bpf_user_ringbuf_drain+0x10/0x10
      [  244.209795]  bpf_prog_33ab33f6a804ba2d_user_ringbuf_callback_const_ptr_to_dynptr_reg_off+0x47/0x4b
      [  244.215922]  bpf_trampoline_6442502480+0x43/0xe3
      [  244.218691]  __x64_sys_prlimit64+0x9/0xf0
      [  244.220912]  do_syscall_64+0xc1/0x1d0
      [  244.223043]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
      [  244.226458] RIP: 0033:0x7ffa3eb8f059
      [  244.228582] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8f 1d 0d 00 f7 d8 64 89 01 48
      [  244.241307] RSP: 002b:00007ffa3e9c6eb8 EFLAGS: 00000206 ORIG_RAX: 000000000000012e
      [  244.246474] RAX: ffffffffffffffda RBX: 00007ffa3e9c7cdc RCX: 00007ffa3eb8f059
      [  244.250478] RDX: 00007ffa3eb162b4 RSI: 0000000000000000 RDI: 00007ffa3e9c7fb0
      [  244.255396] RBP: 00007ffa3e9c6ed0 R08: 00007ffa3e9c76c0 R09: 0000000000000000
      [  244.260195] R10: 0000000000000000 R11: 0000000000000206 R12: ffffffffffffff80
      [  244.264201] R13: 000000000000001c R14: 00007ffc5d6b4260 R15: 00007ffa3e1c7000
      [  244.268303]  </TASK>
      
      Add a check_func_arg_reg_off() to the path in which the BPF verifier
      verifies the arguments of global function arguments, specifically
      those which take an argument of type ARG_PTR_TO_DYNPTR |
      MEM_RDONLY. Also, process_dynptr_func() doesn't appear to perform any
      explicit and strict type matching on the supplied register type, so
      let's also enforce that a register either type PTR_TO_STACK or
      CONST_PTR_TO_DYNPTR is by the caller.
      
      Reported-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Acked-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarMatt Bobrowski <mattbobrowski@google.com>
      Link: https://lore.kernel.org/r/20240625062857.92760-1-mattbobrowski@google.com
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ec2b9a5e
  10. Jun 25, 2024
  11. Jun 24, 2024
    • Sebastian Andrzej Siewior's avatar
      net: Move per-CPU flush-lists to bpf_net_context on PREEMPT_RT. · 3f9fe37d
      Sebastian Andrzej Siewior authored
      
      The per-CPU flush lists, which are accessed from within the NAPI callback
      (xdp_do_flush() for instance), are per-CPU. There are subject to the
      same problem as struct bpf_redirect_info.
      
      Add the per-CPU lists cpu_map_flush_list, dev_map_flush_list and
      xskmap_map_flush_list to struct bpf_net_context. Add wrappers for the
      access. The lists initialized on first usage (similar to
      bpf_net_ctx_get_ri()).
      
      Cc: "Björn Töpel" <bjorn@kernel.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Eduard Zingerman <eddyz87@gmail.com>
      Cc: Hao Luo <haoluo@google.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Cc: KP Singh <kpsingh@kernel.org>
      Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
      Cc: Magnus Karlsson <magnus.karlsson@intel.com>
      Cc: Martin KaFai Lau <martin.lau@linux.dev>
      Cc: Song Liu <song@kernel.org>
      Cc: Stanislav Fomichev <sdf@google.com>
      Cc: Yonghong Song <yonghong.song@linux.dev>
      Acked-by: default avatarJesper Dangaard Brouer <hawk@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: https://patch.msgid.link/20240620132727.660738-16-bigeasy@linutronix.de
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3f9fe37d
    • Sebastian Andrzej Siewior's avatar
      net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. · 401cb7da
      Sebastian Andrzej Siewior authored
      
      The XDP redirect process is two staged:
      - bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
        packet and makes decisions. While doing that, the per-CPU variable
        bpf_redirect_info is used.
      
      - Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
        and it may also access other per-CPU variables like xskmap_flush_list.
      
      At the very end of the NAPI callback, xdp_do_flush() is invoked which
      does not access bpf_redirect_info but will touch the individual per-CPU
      lists.
      
      The per-CPU variables are only used in the NAPI callback hence disabling
      bottom halves is the only protection mechanism. Users from preemptible
      context (like cpu_map_kthread_run()) explicitly disable bottom halves
      for protections reasons.
      Without locking in local_bh_disable() on PREEMPT_RT this data structure
      requires explicit locking.
      
      PREEMPT_RT has forced-threaded interrupts enabled and every
      NAPI-callback runs in a thread. If each thread has its own data
      structure then locking can be avoided.
      
      Create a struct bpf_net_context which contains struct bpf_redirect_info.
      Define the variable on stack, use bpf_net_ctx_set() to save a pointer to
      it, bpf_net_ctx_clear() removes it again.
      The bpf_net_ctx_set() may nest. For instance a function can be used from
      within NET_RX_SOFTIRQ/ net_rx_action which uses bpf_net_ctx_set() and
      NET_TX_SOFTIRQ which does not. Therefore only the first invocations
      updates the pointer.
      Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct
      bpf_redirect_info. The returned data structure is zero initialized to
      ensure nothing is leaked from stack. This is done on first usage of the
      struct. bpf_net_ctx_set() sets bpf_redirect_info::kern_flags to 0 to
      note that initialisation is required. First invocation of
      bpf_net_ctx_get_ri() will memset() the data structure and update
      bpf_redirect_info::kern_flags.
      bpf_redirect_info::nh is excluded from memset because it is only used
      once BPF_F_NEIGH is set which also sets the nh member. The kern_flags is
      moved past nh to exclude it from memset.
      
      The pointer to bpf_net_context is saved task's task_struct. Using
      always the bpf_net_context approach has the advantage that there is
      almost zero differences between PREEMPT_RT and non-PREEMPT_RT builds.
      
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Eduard Zingerman <eddyz87@gmail.com>
      Cc: Hao Luo <haoluo@google.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: KP Singh <kpsingh@kernel.org>
      Cc: Martin KaFai Lau <martin.lau@linux.dev>
      Cc: Song Liu <song@kernel.org>
      Cc: Stanislav Fomichev <sdf@google.com>
      Cc: Yonghong Song <yonghong.song@linux.dev>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarJesper Dangaard Brouer <hawk@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: https://patch.msgid.link/20240620132727.660738-15-bigeasy@linutronix.de
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      401cb7da
    • Sebastian Andrzej Siewior's avatar
      locking/local_lock: Add local nested BH locking infrastructure. · c5bcab75
      Sebastian Andrzej Siewior authored
      
      Add local_lock_nested_bh() locking. It is based on local_lock_t and the
      naming follows the preempt_disable_nested() example.
      
      For !PREEMPT_RT + !LOCKDEP it is a per-CPU annotation for locking
      assumptions based on local_bh_disable(). The macro is optimized away
      during compilation.
      For !PREEMPT_RT + LOCKDEP the local_lock_nested_bh() is reduced to
      the usual lock-acquire plus lockdep_assert_in_softirq() - ensuring that
      BH is disabled.
      
      For PREEMPT_RT local_lock_nested_bh() acquires the specified per-CPU
      lock. It does not disable CPU migration because it relies on
      local_bh_disable() disabling CPU migration.
      With LOCKDEP it performans the usual lockdep checks as with !PREEMPT_RT.
      Due to include hell the softirq check has been moved spinlock.c.
      
      The intention is to use this locking in places where locking of a per-CPU
      variable relies on BH being disabled. Instead of treating disabled
      bottom halves as a big per-CPU lock, PREEMPT_RT can use this to reduce
      the locking scope to what actually needs protecting.
      A side effect is that it also documents the protection scope of the
      per-CPU variables.
      
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: https://patch.msgid.link/20240620132727.660738-3-bigeasy@linutronix.de
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c5bcab75
    • Alexei Starovoitov's avatar
      bpf: Fix may_goto with negative offset. · 2b2efe19
      Alexei Starovoitov authored
      Zac's syzbot crafted a bpf prog that exposed two bugs in may_goto.
      The 1st bug is the way may_goto is patched. When offset is negative
      it should be patched differently.
      The 2nd bug is in the verifier:
      when current state may_goto_depth is equal to visited state may_goto_depth
      it means there is an actual infinite loop. It's not correct to prune
      exploration of the program at this point.
      Note, that this check doesn't limit the program to only one may_goto insn,
      since 2nd and any further may_goto will increment may_goto_depth only
      in the queued state pushed for future exploration. The current state
      will have may_goto_depth == 0 regardless of number of may_goto insns
      and the verifier has to explore the program until bpf_exit.
      
      Fixes: 011832b9
      
       ("bpf: Introduce may_goto instruction")
      Reported-by: default avatarZac Ecob <zacecob@protonmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Closes: https://lore.kernel.org/bpf/CAADnVQL-15aNp04-cyHRn47Yv61NXfYyhopyZtUyxNojUZUXpA@mail.gmail.com/
      Link: https://lore.kernel.org/bpf/20240619235355.85031-1-alexei.starovoitov@gmail.com
      2b2efe19
  12. Jun 23, 2024
  13. Jun 21, 2024
    • Alan Maguire's avatar
      libbpf,bpf: Share BTF relocate-related code with kernel · 8646db23
      Alan Maguire authored
      
      Share relocation implementation with the kernel.  As part of this,
      we also need the type/string iteration functions so also share
      btf_iter.c file. Relocation code in kernel and userspace is identical
      save for the impementation of the reparenting of split BTF to the
      relocated base BTF and retrieval of the BTF header from "struct btf";
      these small functions need separate user-space and kernel implementations
      for the separate "struct btf"s they operate upon.
      
      One other wrinkle on the kernel side is we have to map .BTF.ids in
      modules as they were generated with the type ids used at BTF encoding
      time. btf_relocate() optionally returns an array mapping from old BTF
      ids to relocated ids, so we use that to fix up these references where
      needed for kfuncs.
      
      Signed-off-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/bpf/20240620091733.1967885-5-alan.maguire@oracle.com
      8646db23
    • Alan Maguire's avatar
      module, bpf: Store BTF base pointer in struct module · d4e48e3d
      Alan Maguire authored
      
      ...as this will allow split BTF modules with a base BTF
      representation (rather than the full vmlinux BTF at time of
      BTF encoding) to resolve their references to kernel types in a
      way that is more resilient to small changes in kernel types.
      
      This will allow modules that are not built every time the kernel
      is to provide more resilient BTF, rather than have it invalidated
      every time BTF ids for core kernel types change.
      
      Fields are ordered to avoid holes in struct module.
      
      Signed-off-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Reviewed-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20240620091733.1967885-3-alan.maguire@oracle.com
      d4e48e3d
    • Daniel Borkmann's avatar
      bpf: Fix overrunning reservations in ringbuf · cfa1a232
      Daniel Borkmann authored
      The BPF ring buffer internally is implemented as a power-of-2 sized circular
      buffer, with two logical and ever-increasing counters: consumer_pos is the
      consumer counter to show which logical position the consumer consumed the
      data, and producer_pos which is the producer counter denoting the amount of
      data reserved by all producers.
      
      Each time a record is reserved, the producer that "owns" the record will
      successfully advance producer counter. In user space each time a record is
      read, the consumer of the data advanced the consumer counter once it finished
      processing. Both counters are stored in separate pages so that from user
      space, the producer counter is read-only and the consumer counter is read-write.
      
      One aspect that simplifies and thus speeds up the implementation of both
      producers and consumers is how the data area is mapped twice contiguously
      back-to-back in the virtual memory, allowing to not take any special measures
      for samples that have to wrap around at the end of the circular buffer data
      area, because the next page after the last data page would be first data page
      again, and thus the sample will still appear completely contiguous in virtual
      memory.
      
      Each record has a struct bpf_ringbuf_hdr { u32 len; u32 pg_off; } header for
      book-keeping the length and offset, and is inaccessible to the BPF program.
      Helpers like bpf_ringbuf_reserve() return `(void *)hdr + BPF_RINGBUF_HDR_SZ`
      for the BPF program to use. Bing-Jhong and Muhammad reported that it is however
      possible to make a second allocated memory chunk overlapping with the first
      chunk and as a result, the BPF program is now able to edit first chunk's
      header.
      
      For example, consider the creation of a BPF_MAP_TYPE_RINGBUF map with size
      of 0x4000. Next, the consumer_pos is modified to 0x3000 /before/ a call to
      bpf_ringbuf_reserve() is made. This will allocate a chunk A, which is in
      [0x0,0x3008], and the BPF program is able to edit [0x8,0x3008]. Now, lets
      allocate a chunk B with size 0x3000. This will succeed because consumer_pos
      was edited ahead of time to pass the `new_prod_pos - cons_pos > rb->mask`
      check. Chunk B will be in range [0x3008,0x6010], and the BPF program is able
      to edit [0x3010,0x6010]. Due to the ring buffer memory layout mentioned
      earlier, the ranges [0x0,0x4000] and [0x4000,0x8000] point to the same data
      pages. This means that chunk B at [0x4000,0x4008] is chunk A's header.
      bpf_ringbuf_submit() / bpf_ringbuf_discard() use the header's pg_off to then
      locate the bpf_ringbuf itself via bpf_ringbuf_restore_from_rec(). Once chunk
      B modified chunk A's header, then bpf_ringbuf_commit() refers to the wrong
      page and could cause a crash.
      
      Fix it by calculating the oldest pending_pos and check whether the range
      from the oldest outstanding record to the newest would span beyond the ring
      buffer size. If that is the case, then reject the request. We've tested with
      the ring buffer benchmark in BPF selftests (./benchs/run_bench_ringbufs.sh)
      before/after the fix and while it seems a bit slower on some benchmarks, it
      is still not significantly enough to matter.
      
      Fixes: 457f4436
      
       ("bpf: Implement BPF ring buffer and verifier support for it")
      Reported-by: default avatarBing-Jhong Billy Jheng <billy@starlabs.sg>
      Reported-by: default avatarMuhammad Ramdhan <ramdhan@starlabs.sg>
      Co-developed-by: default avatarBing-Jhong Billy Jheng <billy@starlabs.sg>
      Co-developed-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarBing-Jhong Billy Jheng <billy@starlabs.sg>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20240621140828.18238-1-daniel@iogearbox.net
      cfa1a232
    • Alexei Starovoitov's avatar
      bpf: Fix the corner case with may_goto and jump to the 1st insn. · 5337ac4c
      Alexei Starovoitov authored
      When the following program is processed by the verifier:
      L1: may_goto L2
          goto L1
      L2: w0 = 0
          exit
      
      the may_goto insn is first converted to:
      L1: r11 = *(u64 *)(r10 -8)
          if r11 == 0x0 goto L2
          r11 -= 1
          *(u64 *)(r10 -8) = r11
          goto L1
      L2: w0 = 0
          exit
      
      then later as the last step the verifier inserts:
        *(u64 *)(r10 -8) = BPF_MAX_LOOPS
      as the first insn of the program to initialize loop count.
      
      When the first insn happens to be a branch target of some jmp the
      bpf_patch_insn_data() logic will produce:
      L1: *(u64 *)(r10 -8) = BPF_MAX_LOOPS
          r11 = *(u64 *)(r10 -8)
          if r11 == 0x0 goto L2
          r11 -= 1
          *(u64 *)(r10 -8) = r11
          goto L1
      L2: w0 = 0
          exit
      
      because instruction patching adjusts all jmps and calls, but for this
      particular corner case it's incorrect and the L1 label should be one
      instruction down, like:
          *(u64 *)(r10 -8) = BPF_MAX_LOOPS
      L1: r11 = *(u64 *)(r10 -8)
          if r11 == 0x0 goto L2
          r11 -= 1
          *(u64 *)(r10 -8) = r11
          goto L1
      L2: w0 = 0
          exit
      
      and that's what this patch is fixing.
      After bpf_patch_insn_data() call adjust_jmp_off() to adjust all jmps
      that point to newly insert BPF_ST insn to point to insn after.
      
      Note that bpf_patch_insn_data() cannot easily be changed to accommodate
      this logic, since jumps that point before or after a sequence of patched
      instructions have to be adjusted with the full length of the patch.
      
      Conceptually it's somewhat similar to "insert" of instructions between other
      instructions with weird semantics. Like "insert" before 1st insn would require
      adjustment of CALL insns to point to newly inserted 1st insn, but not an
      adjustment JMP insns that point to 1st, yet still adjusting JMP insns that
      cross over 1st insn (point to insn before or insn after), hence use simple
      adjust_jmp_off() logic to fix this corner case. Ideally bpf_patch_insn_data()
      would have an auxiliary info to say where 'the start of newly inserted patch
      is', but it would be too complex for backport.
      
      Fixes: 011832b9
      
       ("bpf: Introduce may_goto instruction")
      Reported-by: default avatarZac Ecob <zacecob@protonmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Closes: https://lore.kernel.org/bpf/CAADnVQJ_WWx8w4b=6Gc2EpzAjgv+6A0ridnMz2TvS2egj4r3Gw@mail.gmail.com/
      Link: https://lore.kernel.org/bpf/20240619011859.79334-1-alexei.starovoitov@gmail.com
      5337ac4c
    • Matt Bobrowski's avatar
      bpf: Add security_file_post_open() LSM hook to sleepable_lsm_hooks · 6ddf3a9a
      Matt Bobrowski authored
      The new generic LSM hook security_file_post_open() was recently added
      to the LSM framework in commit 8f46ff57
      
       ("security: Introduce
      file_post_open hook"). Let's proactively add this generic LSM hook to
      the sleepable_lsm_hooks BTF ID set, because I can't see there being
      any strong reasons not to, and it's only a matter of time before
      someone else comes around and asks for it to be there.
      
      security_file_post_open() is inherently sleepable as it's purposely
      situated in the kernel that allows LSMs to directly read out the
      contents of the backing file if need be. Additionally, it's called
      directly after security_file_open(), and that LSM hook in itself
      already exists in the sleepable_lsm_hooks BTF ID set.
      
      Signed-off-by: default avatarMatt Bobrowski <mattbobrowski@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20240618192923.379852-1-mattbobrowski@google.com
      6ddf3a9a
    • Jiri Olsa's avatar
      bpf: Change bpf_session_cookie return value to __u64 * · 717d6313
      Jiri Olsa authored
      This reverts [1] and changes return value for bpf_session_cookie
      in bpf selftests. Having long * might lead to problems on 32-bit
      architectures.
      
      Fixes: 2b8dd873
      
       ("bpf: Make bpf_session_cookie() kfunc return long *")
      Suggested-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20240619081624.1620152-1-jolsa@kernel.org
      717d6313
  14. Jun 20, 2024
  15. Jun 18, 2024
  16. Jun 17, 2024
    • Yonghong Song's avatar
      bpf: Add missed var_off setting in coerce_subreg_to_size_sx() · 44b7f715
      Yonghong Song authored
      In coerce_subreg_to_size_sx(), for the case where upper
      sign extension bits are the same for smax32 and smin32
      values, we missed to setup properly. This is especially
      problematic if both smax32 and smin32's sign extension
      bits are 1.
      
      The following is a simple example illustrating the inconsistent
      verifier states due to missed var_off:
      
        0: (85) call bpf_get_prandom_u32#7    ; R0_w=scalar()
        1: (bf) r3 = r0                       ; R0_w=scalar(id=1) R3_w=scalar(id=1)
        2: (57) r3 &= 15                      ; R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=15,var_off=(0x0; 0xf))
        3: (47) r3 |= 128                     ; R3_w=scalar(smin=umin=smin32=umin32=128,smax=umax=smax32=umax32=143,var_off=(0x80; 0xf))
        4: (bc) w7 = (s8)w3
        REG INVARIANTS VIOLATION (alu): range bounds violation u64=[0xffffff80, 0x8f] s64=[0xffffff80, 0x8f]
          u32=[0xffffff80, 0x8f] s32=[0x80, 0xffffff8f] var_off=(0x80, 0xf)
      
      The var_off=(0x80, 0xf) is not correct, and the correct one should
      be var_off=(0xffffff80; 0xf) since from insn 3, we know that at
      insn 4, the sign extension bits will be 1. This patch fixed this
      issue by setting var_off properly.
      
      Fixes: 8100928c
      
       ("bpf: Support new sign-extension mov insns")
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20240615174632.3995278-1-yonghong.song@linux.dev
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      44b7f715
    • Yonghong Song's avatar
      bpf: Add missed var_off setting in set_sext32_default_val() · 380d5f89
      Yonghong Song authored
      Zac reported a verification failure and Alexei reproduced the issue
      with a simple reproducer ([1]). The verification failure is due to missed
      setting for var_off.
      
      The following is the reproducer in [1]:
        0: R1=ctx() R10=fp0
        0: (71) r3 = *(u8 *)(r10 -387)        ;
           R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) R10=fp0
        1: (bc) w7 = (s8)w3                   ;
           R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
           R7_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=127,var_off=(0x0; 0x7f))
        2: (36) if w7 >= 0x2533823b goto pc-3
           mark_precise: frame0: last_idx 2 first_idx 0 subseq_idx -1
           mark_precise: frame0: regs=r7 stack= before 1: (bc) w7 = (s8)w3
           mark_precise: frame0: regs=r3 stack= before 0: (71) r3 = *(u8 *)(r10 -387)
        2: R7_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=127,var_off=(0x0; 0x7f))
        3: (b4) w0 = 0                        ; R0_w=0
        4: (95) exit
      
      Note that after insn 1, the var_off for R7 is (0x0; 0x7f). This is not correct
      since upper 24 bits of w7 could be 0 or 1. So correct var_off should be
      (0x0; 0xffffffff). Missing var_off setting in set_sext32_default_val() caused later
      incorrect analysis in zext_32_to_64(dst_reg) and reg_bounds_sync(dst_reg).
      
      To fix the issue, set var_off correctly in set_sext32_default_val(). The correct
      reg state after insn 1 becomes:
        1: (bc) w7 = (s8)w3                   ;
           R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
           R7_w=scalar(smin=0,smax=umax=0xffffffff,smin32=-128,smax32=127,var_off=(0x0; 0xffffffff))
      and at insn 2, the verifier correctly determines either branch is possible.
      
        [1] https://lore.kernel.org/bpf/CAADnVQLPU0Shz7dWV4bn2BgtGdxN3uFHPeobGBA72tpg5Xoykw@mail.gmail.com/
      
      Fixes: 8100928c
      
       ("bpf: Support new sign-extension mov insns")
      Reported-by: default avatarZac Ecob <zacecob@protonmail.com>
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20240615174626.3994813-1-yonghong.song@linux.dev
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      380d5f89
    • Yuntao Wang's avatar
      cpu/hotplug: Fix dynstate assignment in __cpuhp_setup_state_cpuslocked() · 932d8476
      Yuntao Wang authored
      Commit 4205e478 ("cpu/hotplug: Provide dynamic range for prepare
      stage") added a dynamic range for the prepare states, but did not handle
      the assignment of the dynstate variable in __cpuhp_setup_state_cpuslocked().
      
      This causes the corresponding startup callback not to be invoked when
      calling __cpuhp_setup_state_cpuslocked() with the CPUHP_BP_PREPARE_DYN
      parameter, even though it should be.
      
      Currently, the users of __cpuhp_setup_state_cpuslocked(), for one reason or
      another, have not triggered this bug.
      
      Fixes: 4205e478
      
       ("cpu/hotplug: Provide dynamic range for prepare stage")
      Signed-off-by: default avatarYuntao Wang <ytcoode@gmail.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20240515134554.427071-1-ytcoode@gmail.com
      932d8476