Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Jun 27, 2024
    • Arnd Bergmann's avatar
      kallsyms: rework symbol lookup return codes · 7e1f4eb9
      Arnd Bergmann authored
      Building with W=1 in some configurations produces a false positive
      warning for kallsyms:
      
      kernel/kallsyms.c: In function '__sprint_symbol.isra':
      kernel/kallsyms.c:503:17: error: 'strcpy' source argument is the same as destination [-Werror=restrict]
        503 |                 strcpy(buffer, name);
            |                 ^~~~~~~~~~~~~~~~~~~~
      
      This originally showed up while building with -O3, but later started
      happening in other configurations as well, depending on inlining
      decisions. The underlying issue is that the local 'name' variable is
      always initialized to the be the same as 'buffer' in the called functions
      that fill the buffer, which gcc notices while inlining, though it could
      see that the address check always skips the copy.
      
      The calling conventions here are rather unusual, as all of the internal
      lookup functions (bpf_address_lookup, ftrace_mod_address_lookup,
      ftrace_func_address_lookup, module_address_lookup and
      kallsyms_lookup_buildid) already use the provided buffer and either return
      the address of that buffer to indicate success, or NULL for failure,
      but the callers are written to also expect an arbitrary other buffer
      to be returned.
      
      Rework the calling conventions to return the length of the filled buffer
      instead of its address, which is simpler and easier to follow as well
      as avoiding the warning. Leave only the kallsyms_lookup() calling conventions
      unchanged, since that is called from 16 different functions and
      adapting this would be a much bigger change.
      
      Link: https://lore.kernel.org/lkml/20200107214042.855757-1-arnd@arndb.de/
      Link: https://lore.kernel.org/lkml/20240326130647.7bfb1d92@gandalf.local.home/
      
      
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Reviewed-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Acked-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      7e1f4eb9
  2. Jun 25, 2024
  3. Jun 24, 2024
  4. Jun 23, 2024
  5. Jun 21, 2024
    • Daniel Borkmann's avatar
      bpf: Fix overrunning reservations in ringbuf · cfa1a232
      Daniel Borkmann authored
      The BPF ring buffer internally is implemented as a power-of-2 sized circular
      buffer, with two logical and ever-increasing counters: consumer_pos is the
      consumer counter to show which logical position the consumer consumed the
      data, and producer_pos which is the producer counter denoting the amount of
      data reserved by all producers.
      
      Each time a record is reserved, the producer that "owns" the record will
      successfully advance producer counter. In user space each time a record is
      read, the consumer of the data advanced the consumer counter once it finished
      processing. Both counters are stored in separate pages so that from user
      space, the producer counter is read-only and the consumer counter is read-write.
      
      One aspect that simplifies and thus speeds up the implementation of both
      producers and consumers is how the data area is mapped twice contiguously
      back-to-back in the virtual memory, allowing to not take any special measures
      for samples that have to wrap around at the end of the circular buffer data
      area, because the next page after the last data page would be first data page
      again, and thus the sample will still appear completely contiguous in virtual
      memory.
      
      Each record has a struct bpf_ringbuf_hdr { u32 len; u32 pg_off; } header for
      book-keeping the length and offset, and is inaccessible to the BPF program.
      Helpers like bpf_ringbuf_reserve() return `(void *)hdr + BPF_RINGBUF_HDR_SZ`
      for the BPF program to use. Bing-Jhong and Muhammad reported that it is however
      possible to make a second allocated memory chunk overlapping with the first
      chunk and as a result, the BPF program is now able to edit first chunk's
      header.
      
      For example, consider the creation of a BPF_MAP_TYPE_RINGBUF map with size
      of 0x4000. Next, the consumer_pos is modified to 0x3000 /before/ a call to
      bpf_ringbuf_reserve() is made. This will allocate a chunk A, which is in
      [0x0,0x3008], and the BPF program is able to edit [0x8,0x3008]. Now, lets
      allocate a chunk B with size 0x3000. This will succeed because consumer_pos
      was edited ahead of time to pass the `new_prod_pos - cons_pos > rb->mask`
      check. Chunk B will be in range [0x3008,0x6010], and the BPF program is able
      to edit [0x3010,0x6010]. Due to the ring buffer memory layout mentioned
      earlier, the ranges [0x0,0x4000] and [0x4000,0x8000] point to the same data
      pages. This means that chunk B at [0x4000,0x4008] is chunk A's header.
      bpf_ringbuf_submit() / bpf_ringbuf_discard() use the header's pg_off to then
      locate the bpf_ringbuf itself via bpf_ringbuf_restore_from_rec(). Once chunk
      B modified chunk A's header, then bpf_ringbuf_commit() refers to the wrong
      page and could cause a crash.
      
      Fix it by calculating the oldest pending_pos and check whether the range
      from the oldest outstanding record to the newest would span beyond the ring
      buffer size. If that is the case, then reject the request. We've tested with
      the ring buffer benchmark in BPF selftests (./benchs/run_bench_ringbufs.sh)
      before/after the fix and while it seems a bit slower on some benchmarks, it
      is still not significantly enough to matter.
      
      Fixes: 457f4436
      
       ("bpf: Implement BPF ring buffer and verifier support for it")
      Reported-by: default avatarBing-Jhong Billy Jheng <billy@starlabs.sg>
      Reported-by: default avatarMuhammad Ramdhan <ramdhan@starlabs.sg>
      Co-developed-by: default avatarBing-Jhong Billy Jheng <billy@starlabs.sg>
      Co-developed-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarBing-Jhong Billy Jheng <billy@starlabs.sg>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20240621140828.18238-1-daniel@iogearbox.net
      cfa1a232
    • Alexei Starovoitov's avatar
      bpf: Fix the corner case with may_goto and jump to the 1st insn. · 5337ac4c
      Alexei Starovoitov authored
      When the following program is processed by the verifier:
      L1: may_goto L2
          goto L1
      L2: w0 = 0
          exit
      
      the may_goto insn is first converted to:
      L1: r11 = *(u64 *)(r10 -8)
          if r11 == 0x0 goto L2
          r11 -= 1
          *(u64 *)(r10 -8) = r11
          goto L1
      L2: w0 = 0
          exit
      
      then later as the last step the verifier inserts:
        *(u64 *)(r10 -8) = BPF_MAX_LOOPS
      as the first insn of the program to initialize loop count.
      
      When the first insn happens to be a branch target of some jmp the
      bpf_patch_insn_data() logic will produce:
      L1: *(u64 *)(r10 -8) = BPF_MAX_LOOPS
          r11 = *(u64 *)(r10 -8)
          if r11 == 0x0 goto L2
          r11 -= 1
          *(u64 *)(r10 -8) = r11
          goto L1
      L2: w0 = 0
          exit
      
      because instruction patching adjusts all jmps and calls, but for this
      particular corner case it's incorrect and the L1 label should be one
      instruction down, like:
          *(u64 *)(r10 -8) = BPF_MAX_LOOPS
      L1: r11 = *(u64 *)(r10 -8)
          if r11 == 0x0 goto L2
          r11 -= 1
          *(u64 *)(r10 -8) = r11
          goto L1
      L2: w0 = 0
          exit
      
      and that's what this patch is fixing.
      After bpf_patch_insn_data() call adjust_jmp_off() to adjust all jmps
      that point to newly insert BPF_ST insn to point to insn after.
      
      Note that bpf_patch_insn_data() cannot easily be changed to accommodate
      this logic, since jumps that point before or after a sequence of patched
      instructions have to be adjusted with the full length of the patch.
      
      Conceptually it's somewhat similar to "insert" of instructions between other
      instructions with weird semantics. Like "insert" before 1st insn would require
      adjustment of CALL insns to point to newly inserted 1st insn, but not an
      adjustment JMP insns that point to 1st, yet still adjusting JMP insns that
      cross over 1st insn (point to insn before or insn after), hence use simple
      adjust_jmp_off() logic to fix this corner case. Ideally bpf_patch_insn_data()
      would have an auxiliary info to say where 'the start of newly inserted patch
      is', but it would be too complex for backport.
      
      Fixes: 011832b9
      
       ("bpf: Introduce may_goto instruction")
      Reported-by: default avatarZac Ecob <zacecob@protonmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Closes: https://lore.kernel.org/bpf/CAADnVQJ_WWx8w4b=6Gc2EpzAjgv+6A0ridnMz2TvS2egj4r3Gw@mail.gmail.com/
      Link: https://lore.kernel.org/bpf/20240619011859.79334-1-alexei.starovoitov@gmail.com
      5337ac4c
  6. Jun 18, 2024
  7. Jun 17, 2024
    • Yonghong Song's avatar
      bpf: Add missed var_off setting in coerce_subreg_to_size_sx() · 44b7f715
      Yonghong Song authored
      In coerce_subreg_to_size_sx(), for the case where upper
      sign extension bits are the same for smax32 and smin32
      values, we missed to setup properly. This is especially
      problematic if both smax32 and smin32's sign extension
      bits are 1.
      
      The following is a simple example illustrating the inconsistent
      verifier states due to missed var_off:
      
        0: (85) call bpf_get_prandom_u32#7    ; R0_w=scalar()
        1: (bf) r3 = r0                       ; R0_w=scalar(id=1) R3_w=scalar(id=1)
        2: (57) r3 &= 15                      ; R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=15,var_off=(0x0; 0xf))
        3: (47) r3 |= 128                     ; R3_w=scalar(smin=umin=smin32=umin32=128,smax=umax=smax32=umax32=143,var_off=(0x80; 0xf))
        4: (bc) w7 = (s8)w3
        REG INVARIANTS VIOLATION (alu): range bounds violation u64=[0xffffff80, 0x8f] s64=[0xffffff80, 0x8f]
          u32=[0xffffff80, 0x8f] s32=[0x80, 0xffffff8f] var_off=(0x80, 0xf)
      
      The var_off=(0x80, 0xf) is not correct, and the correct one should
      be var_off=(0xffffff80; 0xf) since from insn 3, we know that at
      insn 4, the sign extension bits will be 1. This patch fixed this
      issue by setting var_off properly.
      
      Fixes: 8100928c
      
       ("bpf: Support new sign-extension mov insns")
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20240615174632.3995278-1-yonghong.song@linux.dev
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      44b7f715
    • Yonghong Song's avatar
      bpf: Add missed var_off setting in set_sext32_default_val() · 380d5f89
      Yonghong Song authored
      Zac reported a verification failure and Alexei reproduced the issue
      with a simple reproducer ([1]). The verification failure is due to missed
      setting for var_off.
      
      The following is the reproducer in [1]:
        0: R1=ctx() R10=fp0
        0: (71) r3 = *(u8 *)(r10 -387)        ;
           R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) R10=fp0
        1: (bc) w7 = (s8)w3                   ;
           R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
           R7_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=127,var_off=(0x0; 0x7f))
        2: (36) if w7 >= 0x2533823b goto pc-3
           mark_precise: frame0: last_idx 2 first_idx 0 subseq_idx -1
           mark_precise: frame0: regs=r7 stack= before 1: (bc) w7 = (s8)w3
           mark_precise: frame0: regs=r3 stack= before 0: (71) r3 = *(u8 *)(r10 -387)
        2: R7_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=127,var_off=(0x0; 0x7f))
        3: (b4) w0 = 0                        ; R0_w=0
        4: (95) exit
      
      Note that after insn 1, the var_off for R7 is (0x0; 0x7f). This is not correct
      since upper 24 bits of w7 could be 0 or 1. So correct var_off should be
      (0x0; 0xffffffff). Missing var_off setting in set_sext32_default_val() caused later
      incorrect analysis in zext_32_to_64(dst_reg) and reg_bounds_sync(dst_reg).
      
      To fix the issue, set var_off correctly in set_sext32_default_val(). The correct
      reg state after insn 1 becomes:
        1: (bc) w7 = (s8)w3                   ;
           R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
           R7_w=scalar(smin=0,smax=umax=0xffffffff,smin32=-128,smax32=127,var_off=(0x0; 0xffffffff))
      and at insn 2, the verifier correctly determines either branch is possible.
      
        [1] https://lore.kernel.org/bpf/CAADnVQLPU0Shz7dWV4bn2BgtGdxN3uFHPeobGBA72tpg5Xoykw@mail.gmail.com/
      
      Fixes: 8100928c
      
       ("bpf: Support new sign-extension mov insns")
      Reported-by: default avatarZac Ecob <zacecob@protonmail.com>
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20240615174626.3994813-1-yonghong.song@linux.dev
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      380d5f89
    • Yuntao Wang's avatar
      cpu/hotplug: Fix dynstate assignment in __cpuhp_setup_state_cpuslocked() · 932d8476
      Yuntao Wang authored
      Commit 4205e478 ("cpu/hotplug: Provide dynamic range for prepare
      stage") added a dynamic range for the prepare states, but did not handle
      the assignment of the dynstate variable in __cpuhp_setup_state_cpuslocked().
      
      This causes the corresponding startup callback not to be invoked when
      calling __cpuhp_setup_state_cpuslocked() with the CPUHP_BP_PREPARE_DYN
      parameter, even though it should be.
      
      Currently, the users of __cpuhp_setup_state_cpuslocked(), for one reason or
      another, have not triggered this bug.
      
      Fixes: 4205e478
      
       ("cpu/hotplug: Provide dynamic range for prepare stage")
      Signed-off-by: default avatarYuntao Wang <ytcoode@gmail.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20240515134554.427071-1-ytcoode@gmail.com
      932d8476
  8. Jun 15, 2024
  9. Jun 13, 2024
    • GUO Zihua's avatar
      ima: Avoid blocking in RCU read-side critical section · 9a95c5bf
      GUO Zihua authored
      A panic happens in ima_match_policy:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
      PGD 42f873067 P4D 0
      Oops: 0000 [#1] SMP NOPTI
      CPU: 5 PID: 1286325 Comm: kubeletmonit.sh
      Kdump: loaded Tainted: P
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
                     BIOS 0.0.0 02/06/2015
      RIP: 0010:ima_match_policy+0x84/0x450
      Code: 49 89 fc 41 89 cf 31 ed 89 44 24 14 eb 1c 44 39
            7b 18 74 26 41 83 ff 05 74 20 48 8b 1b 48 3b 1d
            f2 b9 f4 00 0f 84 9c 01 00 00 <44> 85 73 10 74 ea
            44 8b 6b 14 41 f6 c5 01 75 d4 41 f6 c5 02 74 0f
      RSP: 0018:ff71570009e07a80 EFLAGS: 00010207
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000200
      RDX: ffffffffad8dc7c0 RSI: 0000000024924925 RDI: ff3e27850dea2000
      RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffffabfce739
      R10: ff3e27810cc42400 R11: 0000000000000000 R12: ff3e2781825ef970
      R13: 00000000ff3e2785 R14: 000000000000000c R15: 0000000000000001
      FS:  00007f5195b51740(0000)
      GS:ff3e278b12d40000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000010 CR3: 0000000626d24002 CR4: 0000000000361ee0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       ima_get_action+0x22/0x30
       process_measurement+0xb0/0x830
       ? page_add_file_rmap+0x15/0x170
       ? alloc_set_pte+0x269/0x4c0
       ? prep_new_page+0x81/0x140
       ? simple_xattr_get+0x75/0xa0
       ? selinux_file_open+0x9d/0xf0
       ima_file_check+0x64/0x90
       path_openat+0x571/0x1720
       do_filp_open+0x9b/0x110
       ? page_counter_try_charge+0x57/0xc0
       ? files_cgroup_alloc_fd+0x38/0x60
       ? __alloc_fd+0xd4/0x250
       ? do_sys_open+0x1bd/0x250
       do_sys_open+0x1bd/0x250
       do_syscall_64+0x5d/0x1d0
       entry_SYSCALL_64_after_hwframe+0x65/0xca
      
      Commit c7423dbd ("ima: Handle -ESTALE returned by
      ima_filter_rule_match()") introduced call to ima_lsm_copy_rule within a
      RCU read-side critical section which contains kmalloc with GFP_KERNEL.
      This implies a possible sleep and violates limitations of RCU read-side
      critical sections on non-PREEMPT systems.
      
      Sleeping within RCU read-side critical section might cause
      synchronize_rcu() returning early and break RCU protection, allowing a
      UAF to happen.
      
      The root cause of this issue could be described as follows:
      |	Thread A	|	Thread B	|
      |			|ima_match_policy	|
      |			|  rcu_read_lock	|
      |ima_lsm_update_rule	|			|
      |  synchronize_rcu	|			|
      |			|    kmalloc(GFP_KERNEL)|
      |			|      sleep		|
      ==> synchronize_rcu returns early
      |  kfree(entry)		|			|
      |			|    entry = entry->next|
      ==> UAF happens and entry now becomes NULL (or could be anything).
      |			|    entry->action	|
      ==> Accessing entry might cause panic.
      
      To fix this issue, we are converting all kmalloc that is called within
      RCU read-side critical section to use GFP_ATOMIC.
      
      Fixes: c7423dbd
      
       ("ima: Handle -ESTALE returned by ima_filter_rule_match()")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarGUO Zihua <guozihua@huawei.com>
      Acked-by: default avatarJohn Johansen <john.johansen@canonical.com>
      Reviewed-by: default avatarMimi Zohar <zohar@linux.ibm.com>
      Reviewed-by: default avatarCasey Schaufler <casey@schaufler-ca.com>
      [PM: fixed missing comment, long lines, !CONFIG_IMA_LSM_RULES case]
      Signed-off-by: default avatarPaul Moore <paul@paul-moore.com>
      9a95c5bf
    • Maciej Żenczykowski's avatar
      bpf: fix UML x86_64 compile failure · b99a95bc
      Maciej Żenczykowski authored
      pcpu_hot (defined in arch/x86) is not available on user mode linux (ARCH=um)
      
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Fixes: 1ae69210
      
       ("bpf: inline bpf_get_smp_processor_id() helper")
      Signed-off-by: default avatarMaciej Żenczykowski <maze@google.com>
      Link: https://lore.kernel.org/r/20240613173146.2524647-1-maze@google.com
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b99a95bc
    • Daniel Borkmann's avatar
      bpf: Reduce stack consumption in check_stack_write_fixed_off · e73cd1cf
      Daniel Borkmann authored
      
      The fake_reg moved into env->fake_reg given it consumes a lot of stack
      space (120 bytes). Migrate the fake_reg in check_stack_write_fixed_off()
      as well now that we have it.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/r/20240613115310.25383-2-daniel@iogearbox.net
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e73cd1cf
    • Daniel Borkmann's avatar
      bpf: Fix reg_set_min_max corruption of fake_reg · 92424801
      Daniel Borkmann authored
      Juan reported that after doing some changes to buzzer [0] and implementing
      a new fuzzing strategy guided by coverage, they noticed the following in
      one of the probes:
      
        [...]
        13: (79) r6 = *(u64 *)(r0 +0)         ; R0=map_value(ks=4,vs=8) R6_w=scalar()
        14: (b7) r0 = 0                       ; R0_w=0
        15: (b4) w0 = -1                      ; R0_w=0xffffffff
        16: (74) w0 >>= 1                     ; R0_w=0x7fffffff
        17: (5c) w6 &= w0                     ; R0_w=0x7fffffff R6_w=scalar(smin=smin32=0,smax=umax=umax32=0x7fffffff,var_off=(0x0; 0x7fffffff))
        18: (44) w6 |= 2                      ; R6_w=scalar(smin=umin=smin32=umin32=2,smax=umax=umax32=0x7fffffff,var_off=(0x2; 0x7ffffffd))
        19: (56) if w6 != 0x7ffffffd goto pc+1
        REG INVARIANTS VIOLATION (true_reg2): range bounds violation u64=[0x7fffffff, 0x7ffffffd] s64=[0x7fffffff, 0x7ffffffd] u32=[0x7fffffff, 0x7ffffffd] s32=[0x7fffffff, 0x7ffffffd] var_off=(0x7fffffff, 0x0)
        REG INVARIANTS VIOLATION (false_reg1): range bounds violation u64=[0x7fffffff, 0x7ffffffd] s64=[0x7fffffff, 0x7ffffffd] u32=[0x7fffffff, 0x7ffffffd] s32=[0x7fffffff, 0x7ffffffd] var_off=(0x7fffffff, 0x0)
        REG INVARIANTS VIOLATION (false_reg2): const tnum out of sync with range bounds u64=[0x0, 0xffffffffffffffff] s64=[0x8000000000000000, 0x7fffffffffffffff] u32=[0x0, 0xffffffff] s32=[0x80000000, 0x7fffffff] var_off=(0x7fffffff, 0x0)
        19: R6_w=0x7fffffff
        20: (95) exit
      
        from 19 to 21: R0=0x7fffffff R6=scalar(smin=umin=smin32=umin32=2,smax=umax=smax32=umax32=0x7ffffffe,var_off=(0x2; 0x7ffffffd)) R7=map_ptr(ks=4,vs=8) R9=ctx() R10=fp0 fp-24=map_ptr(ks=4,vs=8) fp-40=mmmmmmmm
        21: R0=0x7fffffff R6=scalar(smin=umin=smin32=umin32=2,smax=umax=smax32=umax32=0x7ffffffe,var_off=(0x2; 0x7ffffffd)) R7=map_ptr(ks=4,vs=8) R9=ctx() R10=fp0 fp-24=map_ptr(ks=4,vs=8) fp-40=mmmmmmmm
        21: (14) w6 -= 2147483632             ; R6_w=scalar(smin=umin=umin32=2,smax=umax=0xffffffff,smin32=0x80000012,smax32=14,var_off=(0x2; 0xfffffffd))
        22: (76) if w6 s>= 0xe goto pc+1      ; R6_w=scalar(smin=umin=umin32=2,smax=umax=0xffffffff,smin32=0x80000012,smax32=13,var_off=(0x2; 0xfffffffd))
        23: (95) exit
      
        from 22 to 24: R0=0x7fffffff R6_w=14 R7=map_ptr(ks=4,vs=8) R9=ctx() R10=fp0 fp-24=map_ptr(ks=4,vs=8) fp-40=mmmmmmmm
        24: R0=0x7fffffff R6_w=14 R7=map_ptr(ks=4,vs=8) R9=ctx() R10=fp0 fp-24=map_ptr(ks=4,vs=8) fp-40=mmmmmmmm
        24: (14) w6 -= 14                     ; R6_w=0
        [...]
      
      What can be seen here is a register invariant violation on line 19. After
      the binary-or in line 18, the verifier knows that bit 2 is set but knows
      nothing about the rest of the content which was loaded from a map value,
      meaning, range is [2,0x7fffffff] with var_off=(0x2; 0x7ffffffd). When in
      line 19 the verifier analyzes the branch, it splits the register states
      in reg_set_min_max() into the registers of the true branch (true_reg1,
      true_reg2) and the registers of the false branch (false_reg1, false_reg2).
      
      Since the test is w6 != 0x7ffffffd, the src_reg is a known constant.
      Internally, the verifier creates a "fake" register initialized as scalar
      to the value of 0x7ffffffd, and then passes it onto reg_set_min_max(). Now,
      for line 19, it is mathematically impossible to take the false branch of
      this program, yet the verifier analyzes it. It is impossible because the
      second bit of r6 will be set due to the prior or operation and the
      constant in the condition has that bit unset (hex(fd) == binary(1111 1101).
      
      When the verifier first analyzes the false / fall-through branch, it will
      compute an intersection between the var_off of r6 and of the constant. This
      is because the verifier creates a "fake" register initialized to the value
      of the constant. The intersection result later refines both registers in
      regs_refine_cond_op():
      
        [...]
        t = tnum_intersect(tnum_subreg(reg1->var_off), tnum_subreg(reg2->var_off));
        reg1->var_off = tnum_with_subreg(reg1->var_off, t);
        reg2->var_off = tnum_with_subreg(reg2->var_off, t);
        [...]
      
      Since the verifier is analyzing the false branch of the conditional jump,
      reg1 is equal to false_reg1 and reg2 is equal to false_reg2, i.e. the reg2
      is the "fake" register that was meant to hold a constant value. The resulting
      var_off of the intersection says that both registers now hold a known value
      of var_off=(0x7fffffff, 0x0) or in other words: this operation manages to
      make the verifier think that the "constant" value that was passed in the
      jump operation now holds a different value.
      
      Normally this would not be an issue since it should not influence the true
      branch, however, false_reg2 and true_reg2 are pointers to the same "fake"
      register. Meaning, the false branch can influence the results of the true
      branch. In line 24, the verifier assumes R6_w=0, but the actual runtime
      value in this case is 1. The fix is simply not passing in the same "fake"
      register location as inputs to reg_set_min_max(), but instead making a
      copy. Moving the fake_reg into the env also reduces stack consumption by
      120 bytes. With this, the verifier successfully rejects invalid accesses
      from the test program.
      
        [0] https://github.com/google/buzzer
      
      Fixes: 67420501
      
       ("bpf: generalize reg_set_min_max() to handle non-const register comparisons")
      Reported-by: default avatarJuan José López Jaimez <jjlopezjaimez@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/r/20240613115310.25383-1-daniel@iogearbox.net
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      92424801
  10. Jun 11, 2024
    • Masami Hiramatsu (Google)'s avatar
      tracing: Build event generation tests only as modules · 3572bd56
      Masami Hiramatsu (Google) authored
      The kprobes and synth event generation test modules add events and lock
      (get a reference) those event file reference in module init function,
      and unlock and delete it in module exit function. This is because those
      are designed for playing as modules.
      
      If we make those modules as built-in, those events are left locked in the
      kernel, and never be removed. This causes kprobe event self-test failure
      as below.
      
      [   97.349708] ------------[ cut here ]------------
      [   97.353453] WARNING: CPU: 3 PID: 1 at kernel/trace/trace_kprobe.c:2133 kprobe_trace_self_tests_init+0x3f1/0x480
      [   97.357106] Modules linked in:
      [   97.358488] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 6.9.0-g699646734ab5-dirty #14
      [   97.361556] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
      [   97.363880] RIP: 0010:kprobe_trace_self_tests_init+0x3f1/0x480
      [   97.365538] Code: a8 24 08 82 e9 ae fd ff ff 90 0f 0b 90 48 c7 c7 e5 aa 0b 82 e9 ee fc ff ff 90 0f 0b 90 48 c7 c7 2d 61 06 82 e9 8e fd ff ff 90 <0f> 0b 90 48 c7 c7 33 0b 0c 82 89 c6 e8 6e 03 1f ff 41 ff c7 e9 90
      [   97.370429] RSP: 0000:ffffc90000013b50 EFLAGS: 00010286
      [   97.371852] RAX: 00000000fffffff0 RBX: ffff888005919c00 RCX: 0000000000000000
      [   97.373829] RDX: ffff888003f40000 RSI: ffffffff8236a598 RDI: ffff888003f40a68
      [   97.375715] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
      [   97.377675] R10: ffffffff811c9ae5 R11: ffffffff8120c4e0 R12: 0000000000000000
      [   97.379591] R13: 0000000000000001 R14: 0000000000000015 R15: 0000000000000000
      [   97.381536] FS:  0000000000000000(0000) GS:ffff88807dcc0000(0000) knlGS:0000000000000000
      [   97.383813] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   97.385449] CR2: 0000000000000000 CR3: 0000000002244000 CR4: 00000000000006b0
      [   97.387347] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   97.389277] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   97.391196] Call Trace:
      [   97.391967]  <TASK>
      [   97.392647]  ? __warn+0xcc/0x180
      [   97.393640]  ? kprobe_trace_self_tests_init+0x3f1/0x480
      [   97.395181]  ? report_bug+0xbd/0x150
      [   97.396234]  ? handle_bug+0x3e/0x60
      [   97.397311]  ? exc_invalid_op+0x1a/0x50
      [   97.398434]  ? asm_exc_invalid_op+0x1a/0x20
      [   97.399652]  ? trace_kprobe_is_busy+0x20/0x20
      [   97.400904]  ? tracing_reset_all_online_cpus+0x15/0x90
      [   97.402304]  ? kprobe_trace_self_tests_init+0x3f1/0x480
      [   97.403773]  ? init_kprobe_trace+0x50/0x50
      [   97.404972]  do_one_initcall+0x112/0x240
      [   97.406113]  do_initcall_level+0x95/0xb0
      [   97.407286]  ? kernel_init+0x1a/0x1a0
      [   97.408401]  do_initcalls+0x3f/0x70
      [   97.409452]  kernel_init_freeable+0x16f/0x1e0
      [   97.410662]  ? rest_init+0x1f0/0x1f0
      [   97.411738]  kernel_init+0x1a/0x1a0
      [   97.412788]  ret_from_fork+0x39/0x50
      [   97.413817]  ? rest_init+0x1f0/0x1f0
      [   97.414844]  ret_from_fork_asm+0x11/0x20
      [   97.416285]  </TASK>
      [   97.417134] irq event stamp: 13437323
      [   97.418376] hardirqs last  enabled at (13437337): [<ffffffff8110bc0c>] console_unlock+0x11c/0x150
      [   97.421285] hardirqs last disabled at (13437370): [<ffffffff8110bbf1>] console_unlock+0x101/0x150
      [   97.423838] softirqs last  enabled at (13437366): [<ffffffff8108e17f>] handle_softirqs+0x23f/0x2a0
      [   97.426450] softirqs last disabled at (13437393): [<ffffffff8108e346>] __irq_exit_rcu+0x66/0xd0
      [   97.428850] ---[ end trace 0000000000000000 ]---
      
      And also, since we can not cleanup dynamic_event file, ftracetest are
      failed too.
      
      To avoid these issues, build these tests only as modules.
      
      Link: https://lore.kernel.org/all/171811263754.85078.5877446624311852525.stgit@devnote2/
      
      Fixes: 9fe41efa ("tracing: Add synth event generation test module")
      Fixes: 64836248
      
       ("tracing: Add kprobe event command generation test module")
      Signed-off-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Reviewed-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      3572bd56
  11. Jun 10, 2024
  12. Jun 05, 2024
    • Haifeng Xu's avatar
      perf/core: Fix missing wakeup when waiting for context reference · 74751ef5
      Haifeng Xu authored
      In our production environment, we found many hung tasks which are
      blocked for more than 18 hours. Their call traces are like this:
      
      [346278.191038] __schedule+0x2d8/0x890
      [346278.191046] schedule+0x4e/0xb0
      [346278.191049] perf_event_free_task+0x220/0x270
      [346278.191056] ? init_wait_var_entry+0x50/0x50
      [346278.191060] copy_process+0x663/0x18d0
      [346278.191068] kernel_clone+0x9d/0x3d0
      [346278.191072] __do_sys_clone+0x5d/0x80
      [346278.191076] __x64_sys_clone+0x25/0x30
      [346278.191079] do_syscall_64+0x5c/0xc0
      [346278.191083] ? syscall_exit_to_user_mode+0x27/0x50
      [346278.191086] ? do_syscall_64+0x69/0xc0
      [346278.191088] ? irqentry_exit_to_user_mode+0x9/0x20
      [346278.191092] ? irqentry_exit+0x19/0x30
      [346278.191095] ? exc_page_fault+0x89/0x160
      [346278.191097] ? asm_exc_page_fault+0x8/0x30
      [346278.191102] entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      The task was waiting for the refcount become to 1, but from the vmcore,
      we found the refcount has already b...
      74751ef5
  13. Jun 03, 2024
  14. May 31, 2024
  15. May 29, 2024
    • Miguel Ojeda's avatar
      kheaders: use `command -v` to test for existence of `cpio` · 6e58e017
      Miguel Ojeda authored
      Commit 13e1df09 ("kheaders: explicitly validate existence of cpio
      command") added an explicit check for `cpio` using `type`.
      
      However, `type` in `dash` (which is used in some popular distributions
      and base images as the shell script runner) prints the missing message
      to standard output, and thus no error is printed:
      
          $ bash -c 'type missing >/dev/null'
          bash: line 1: type: missing: not found
          $ dash -c 'type missing >/dev/null'
          $
      
      For instance, this issue may be seen by loongarch builders, given its
      defconfig enables CONFIG_IKHEADERS since commit 9cc1df42 ("LoongArch:
      Update Loongson-3 default config file").
      
      Therefore, use `command -v` instead to have consistent behavior, and
      take the chance to provide a more explicit error.
      
      Fixes: 13e1df09
      
       ("kheaders: explicitly validate existence of cpio command")
      Signed-off-by: default avatarMiguel Ojeda <ojeda@kernel.org>
      Signed-off-by: default avatarMasahiro Yamada <masahiroy@kernel.org>
      6e58e017
    • Matthias Maennich's avatar
      kheaders: explicitly define file modes for archived headers · 3bd27a84
      Matthias Maennich authored
      
      Build environments might be running with different umask settings
      resulting in indeterministic file modes for the files contained in
      kheaders.tar.xz. The file itself is served with 444, i.e. world
      readable. Archive the files explicitly with 744,a+X to improve
      reproducibility across build environments.
      
      --mode=0444 is not suitable as directories need to be executable. Also,
      444 makes it hard to delete all the readonly files after extraction.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMatthias Maennich <maennich@google.com>
      Signed-off-by: default avatarMasahiro Yamada <masahiroy@kernel.org>
      3bd27a84
  16. May 27, 2024
  17. May 25, 2024
    • Andrii Nakryiko's avatar
      bpf: remove unnecessary rcu_read_{lock,unlock}() in multi-uprobe attach logic · 4a8f635a
      Andrii Nakryiko authored
      
      get_pid_task() internally already calls rcu_read_lock() and
      rcu_read_unlock(), so there is no point to do this one extra time.
      
      This is a drive-by improvement and has no correctness implications.
      
      Acked-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20240521163401.3005045-3-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4a8f635a
    • Andrii Nakryiko's avatar
      bpf: fix multi-uprobe PID filtering logic · 46ba0e49
      Andrii Nakryiko authored
      
      Current implementation of PID filtering logic for multi-uprobes in
      uprobe_prog_run() is filtering down to exact *thread*, while the intent
      for PID filtering it to filter by *process* instead. The check in
      uprobe_prog_run() also differs from the analogous one in
      uprobe_multi_link_filter() for some reason. The latter is correct,
      checking task->mm, not the task itself.
      
      Fix the check in uprobe_prog_run() to perform the same task->mm check.
      
      While doing this, we also update get_pid_task() use to use PIDTYPE_TGID
      type of lookup, given the intent is to get a representative task of an
      entire process. This doesn't change behavior, but seems more logical. It
      would hold task group leader task now, not any random thread task.
      
      Last but not least, given multi-uprobe support is half-broken due to
      this PID filtering logic (depending on whether PID filtering is
      important or not), we need to make it easy for user space consumers
      (including libbpf) to easily detect whether PID filtering logic was
      already fixed.
      
      We do it here by adding an early check on passed pid parameter. If it's
      negative (and so has no chance of being a valid PID), we return -EINVAL.
      Previous behavior would eventually return -ESRCH ("No process found"),
      given there can't be any process with negative PID. This subtle change
      won't make any practical change in behavior, but will allow applications
      to detect PID filtering fixes easily. Libbpf fixes take advantage of
      this in the next patch.
      
      Cc: stable@vger.kernel.org
      Acked-by: default avatarJiri Olsa <jolsa@kernel.org>
      Fixes: b733eead
      
       ("bpf: Add pid filter support for uprobe_multi link")
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20240521163401.3005045-2-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      46ba0e49
  18. May 24, 2024
  19. May 23, 2024
    • Jeff Xu's avatar
      mseal: wire up mseal syscall · ff388fe5
      Jeff Xu authored
      Patch series "Introduce mseal", v10.
      
      This patchset proposes a new mseal() syscall for the Linux kernel.
      
      In a nutshell, mseal() protects the VMAs of a given virtual memory range
      against modifications, such as changes to their permission bits.
      
      Modern CPUs support memory permissions, such as the read/write (RW) and
      no-execute (NX) bits.  Linux has supported NX since the release of kernel
      version 2.6.8 in August 2004 [1].  The memory permission feature improves
      the security stance on memory corruption bugs, as an attacker cannot
      simply write to arbitrary memory and point the code to it.  The memory
      must be marked with the X bit, or else an exception will occur. 
      Internally, the kernel maintains the memory permissions in a data
      structure called VMA (vm_area_struct).  mseal() additionally protects the
      VMA itself against modifications of the selected seal type.
      
      Memory sealing is useful to mitigate memory corruption issues where a
      corrupted pointer is passed to a memory management...
      ff388fe5
    • Andrii Nakryiko's avatar
      uprobes: prevent mutex_lock() under rcu_read_lock() · 69964673
      Andrii Nakryiko authored
      Recent changes made uprobe_cpu_buffer preparation lazy, and moved it
      deeper into __uprobe_trace_func(). This is problematic because
      __uprobe_trace_func() is called inside rcu_read_lock()/rcu_read_unlock()
      block, which then calls prepare_uprobe_buffer() -> uprobe_buffer_get() ->
      mutex_lock(&ucb->mutex), leading to a splat about using mutex under
      non-sleepable RCU:
      
        BUG: sleeping function called from invalid context at kernel/locking/mutex.c:585
         in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 98231, name: stress-ng-sigq
         preempt_count: 0, expected: 0
         RCU nest depth: 1, expected: 0
         ...
         Call Trace:
          <TASK>
          dump_stack_lvl+0x3d/0xe0
          __might_resched+0x24c/0x270
          ? prepare_uprobe_buffer+0xd5/0x1d0
          __mutex_lock+0x41/0x820
          ? ___perf_sw_event+0x206/0x290
          ? __perf_event_task_sched_in+0x54/0x660
          ? __perf_event_task_sched_in+0x54/0x660
          prepare_uprobe_buffer+0xd5/0x1d0
          __uprobe_trace_func+0x4a/0x140
          uprobe_dispatcher+0x135/0x280
          ? uprobe_dispatcher+0x94/0x280
          uprobe_notify_resume+0x650/0xec0
          ? atomic_notifier_call_chain+0x21/0x110
          ? atomic_notifier_call_chain+0xf8/0x110
          irqentry_exit_to_user_mode+0xe2/0x1e0
          asm_exc_int3+0x35/0x40
         RIP: 0033:0x7f7e1d4da390
         Code: 33 04 00 0f 1f 80 00 00 00 00 f3 0f 1e fa b9 01 00 00 00 e9 b2 fc ff ff 66 90 f3 0f 1e fa 31 c9 e9 a5 fc ff ff 0f 1f 44 00 00 <cc> 0f 1e fa b8 27 00 00 00 0f 05 c3 0f 1f 40 00 f3 0f 1e fa b8 6e
         RSP: 002b:00007ffd2abc3608 EFLAGS: 00000246
         RAX: 0000000000000000 RBX: 0000000076d325f1 RCX: 0000000000000000
         RDX: 0000000076d325f1 RSI: 000000000000000a RDI: 00007ffd2abc3690
         RBP: 000000000000000a R08: 00017fb700000000 R09: 00017fb700000000
         R10: 00017fb700000000 R11: 0000000000000246 R12: 0000000000017ff2
         R13: 00007ffd2abc3610 R14: 0000000000000000 R15: 00007ffd2abc3780
          </TASK>
      
      Luckily, it's easy to fix by moving prepare_uprobe_buffer() to be called
      slightly earlier: into uprobe_trace_func() and uretprobe_trace_func(), outside
      of RCU locked section. This still keeps this buffer preparation lazy and helps
      avoid the overhead when it's not needed. E.g., if there is only BPF uprobe
      handler installed on a given uprobe, buffer won't be initialized.
      
      Note, the other user of prepare_uprobe_buffer(), __uprobe_perf_func(), is not
      affected, as it doesn't prepare buffer under RCU read lock.
      
      Link: https://lore.kernel.org/all/20240521053017.3708530-1-andrii@kernel.org/
      
      Fixes: 1b8f85de
      
       ("uprobes: prepare uprobe args buffer lazily")
      Reported-by: default avatarBreno Leitao <leitao@debian.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      69964673
    • Dongli Zhang's avatar
      genirq/cpuhotplug, x86/vector: Prevent vector leak during CPU offline · a6c11c0a
      Dongli Zhang authored
      The absence of IRQD_MOVE_PCNTXT prevents immediate effectiveness of
      interrupt affinity reconfiguration via procfs. Instead, the change is
      deferred until the next instance of the interrupt being triggered on the
      original CPU.
      
      When the interrupt next triggers on the original CPU, the new affinity is
      enforced within __irq_move_irq(). A vector is allocated from the new CPU,
      but the old vector on the original CPU remains and is not immediately
      reclaimed. Instead, apicd->move_in_progress is flagged, and the reclaiming
      process is delayed until the next trigger of the interrupt on the new CPU.
      
      Upon the subsequent triggering of the interrupt on the new CPU,
      irq_complete_move() adds a task to the old CPU's vector_cleanup list if it
      remains online. Subsequently, the timer on the old CPU iterates over its
      vector_cleanup list, reclaiming old vectors.
      
      However, a rare scenario arises if the old CPU is outgoing before the
      interrupt triggers again on the new CPU.
      
      In that case irq_force_complete_move() is not invoked on the outgoing CPU
      to reclaim the old apicd->prev_vector because the interrupt isn't currently
      affine to the outgoing CPU, and irq_needs_fixup() returns false. Even
      though __vector_schedule_cleanup() is later called on the new CPU, it
      doesn't reclaim apicd->prev_vector; instead, it simply resets both
      apicd->move_in_progress and apicd->prev_vector to 0.
      
      As a result, the vector remains unreclaimed in vector_matrix, leading to a
      CPU vector leak.
      
      To address this issue, move the invocation of irq_force_complete_move()
      before the irq_needs_fixup() call to reclaim apicd->prev_vector, if the
      interrupt is currently or used to be affine to the outgoing CPU.
      
      Additionally, reclaim the vector in __vector_schedule_cleanup() as well,
      following a warning message, although theoretically it should never see
      apicd->move_in_progress with apicd->prev_cpu pointing to an offline CPU.
      
      Fixes: f0383c24
      
       ("genirq/cpuhotplug: Add support for cleaning up move in progress")
      Signed-off-by: default avatarDongli Zhang <dongli.zhang@oracle.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20240522220218.162423-1-dongli.zhang@oracle.com
      a6c11c0a
    • Fedor Pchelkin's avatar
      dma-mapping: benchmark: handle NUMA_NO_NODE correctly · e64746e7
      Fedor Pchelkin authored
      cpumask_of_node() can be called for NUMA_NO_NODE inside do_map_benchmark()
      resulting in the following sanitizer report:
      
      UBSAN: array-index-out-of-bounds in ./arch/x86/include/asm/topology.h:72:28
      index -1 is out of range for type 'cpumask [64][1]'
      CPU: 1 PID: 990 Comm: dma_map_benchma Not tainted 6.9.0-rc6 #29
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
      Call Trace:
       <TASK>
      dump_stack_lvl (lib/dump_stack.c:117)
      ubsan_epilogue (lib/ubsan.c:232)
      __ubsan_handle_out_of_bounds (lib/ubsan.c:429)
      cpumask_of_node (arch/x86/include/asm/topology.h:72) [inline]
      do_map_benchmark (kernel/dma/map_benchmark.c:104)
      map_benchmark_ioctl (kernel/dma/map_benchmark.c:246)
      full_proxy_unlocked_ioctl (fs/debugfs/file.c:333)
      __x64_sys_ioctl (fs/ioctl.c:890)
      do_syscall_64 (arch/x86/entry/common.c:83)
      entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
      
      Use cpumask_of_node() in place when binding a kernel thread to a cpuset
      of a particular node.
      
      Note that t...
      e64746e7
    • Fedor Pchelkin's avatar
      dma-mapping: benchmark: fix node id validation · 1ff05e72
      Fedor Pchelkin authored
      While validating node ids in map_benchmark_ioctl(), node_possible() may
      be provided with invalid argument outside of [0,MAX_NUMNODES-1] range
      leading to:
      
      BUG: KASAN: wild-memory-access in map_benchmark_ioctl (kernel/dma/map_benchmark.c:214)
      Read of size 8 at addr 1fffffff8ccb6398 by task dma_map_benchma/971
      CPU: 7 PID: 971 Comm: dma_map_benchma Not tainted 6.9.0-rc6 #37
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
      Call Trace:
       <TASK>
      dump_stack_lvl (lib/dump_stack.c:117)
      kasan_report (mm/kasan/report.c:603)
      kasan_check_range (mm/kasan/generic.c:189)
      variable_test_bit (arch/x86/include/asm/bitops.h:227) [inline]
      arch_test_bit (arch/x86/include/asm/bitops.h:239) [inline]
      _test_bit at (include/asm-generic/bitops/instrumented-non-atomic.h:142) [inline]
      node_state (include/linux/nodemask.h:423) [inline]
      map_benchmark_ioctl (kernel/dma/map_benchmark.c:214)
      full_proxy_unlocked_ioctl (fs/debugfs/file.c:333)
      __x64_sys_ioctl (fs/ioctl.c:890)
      do_syscall_64 (arch/x86/entry/common.c:83)
      entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
      
      Compare node ids with sane bounds first. NUMA_NO_NODE is considered a
      special valid case meaning that benchmarking kthreads won't be bound to a
      cpuset of a given node.
      
      Found by Linux Verification Center (linuxtesting.org).
      
      Fixes: 65789daa
      
       ("dma-mapping: add benchmark support for streaming DMA APIs")
      Signed-off-by: default avatarFedor Pchelkin <pchelkin@ispras.ru>
      Reviewed-by: default avatarRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      1ff05e72