- Jun 27, 2024
-
-
Arnd Bergmann authored
Building with W=1 in some configurations produces a false positive warning for kallsyms: kernel/kallsyms.c: In function '__sprint_symbol.isra': kernel/kallsyms.c:503:17: error: 'strcpy' source argument is the same as destination [-Werror=restrict] 503 | strcpy(buffer, name); | ^~~~~~~~~~~~~~~~~~~~ This originally showed up while building with -O3, but later started happening in other configurations as well, depending on inlining decisions. The underlying issue is that the local 'name' variable is always initialized to the be the same as 'buffer' in the called functions that fill the buffer, which gcc notices while inlining, though it could see that the address check always skips the copy. The calling conventions here are rather unusual, as all of the internal lookup functions (bpf_address_lookup, ftrace_mod_address_lookup, ftrace_func_address_lookup, module_address_lookup and kallsyms_lookup_buildid) already use the provided buffer and either return the address of that buffer to indicate success, or NULL for failure, but the callers are written to also expect an arbitrary other buffer to be returned. Rework the calling conventions to return the length of the filled buffer instead of its address, which is simpler and easier to follow as well as avoiding the warning. Leave only the kallsyms_lookup() calling conventions unchanged, since that is called from 16 different functions and adapting this would be a much bigger change. Link: https://lore.kernel.org/lkml/20200107214042.855757-1-arnd@arndb.de/ Link: https://lore.kernel.org/lkml/20240326130647.7bfb1d92@gandalf.local.home/ Tested-by:
Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by:
Luis Chamberlain <mcgrof@kernel.org> Acked-by:
Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by:
Arnd Bergmann <arnd@arndb.de>
-
- Jun 25, 2024
-
-
Phil Chang authored
The hrtimer function callback must not be NULL. It has to be specified by the call side but it is not validated by the hrtimer code. When a hrtimer is queued without a function callback, the kernel crashes with a null pointer dereference when trying to execute the callback in __run_hrtimer(). Introduce a validation before queuing the hrtimer in hrtimer_start_range_ns(). [anna-maria: Rephrase commit message] Signed-off-by:
Phil Chang <phil.chang@mediatek.com> Signed-off-by:
Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Reviewed-by:
Anna-Maria Behnsen <anna-maria@linutronix.de>
-
Arnd Bergmann authored
Using sys_io_pgetevents() as the entry point for compat mode tasks works almost correctly, but misses the sign extension for the min_nr and nr arguments. This was addressed on parisc by switching to compat_sys_io_pgetevents_time64() in commit 6431e92f ("parisc: io_pgetevents_time64() needs compat syscall in 32-bit compat mode"), as well as by using more sophisticated system call wrappers on x86 and s390. However, arm64, mips, powerpc, sparc and riscv still have the same bug. Change all of them over to use compat_sys_io_pgetevents_time64() like parisc already does. This was clearly the intention when the function was originally added, but it got hooked up incorrectly in the tables. Cc: stable@vger.kernel.org Fixes: 48166e6e ("y2038: add 64-bit time_t syscalls to all 32-bit architectures") Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390 Signed-off-by:
Arnd Bergmann <arnd@arndb.de>
-
Greg Kroah-Hartman authored
This reverts commit f03e8c10. Let's roll back all of the serial core and printk console changes that went into 6.10-rc1 as there still are problems with them that need to be sorted out. Link: https://lore.kernel.org/r/ZnpRozsdw6zbjqze@tlindgre-MOBL1 Reported-by:
Petr Mladek <pmladek@suse.com> Reported-by:
Tony Lindgren <tony@atomide.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: John Ogness <john.ogness@linutronix.de> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Greg Kroah-Hartman authored
This reverts commit 8a831c58. Let's roll back all of the serial core and printk console changes that went into 6.10-rc1 as there still are problems with them that need to be sorted out. Link: https://lore.kernel.org/r/ZnpRozsdw6zbjqze@tlindgre-MOBL1 Reported-by:
Petr Mladek <pmladek@suse.com> Reported-by:
Tony Lindgren <tony@atomide.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: John Ogness <john.ogness@linutronix.de> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Greg Kroah-Hartman authored
This reverts commit b73c9cbe. Let's roll back all of the serial core and printk console changes that went into 6.10-rc1 as there still are problems with them that need to be sorted out. Link: https://lore.kernel.org/r/ZnpRozsdw6zbjqze@tlindgre-MOBL1 Reported-by:
Petr Mladek <pmladek@suse.com> Reported-by:
Tony Lindgren <tony@atomide.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: John Ogness <john.ogness@linutronix.de> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- Jun 24, 2024
-
-
Alexei Starovoitov authored
Zac's syzbot crafted a bpf prog that exposed two bugs in may_goto. The 1st bug is the way may_goto is patched. When offset is negative it should be patched differently. The 2nd bug is in the verifier: when current state may_goto_depth is equal to visited state may_goto_depth it means there is an actual infinite loop. It's not correct to prune exploration of the program at this point. Note, that this check doesn't limit the program to only one may_goto insn, since 2nd and any further may_goto will increment may_goto_depth only in the queued state pushed for future exploration. The current state will have may_goto_depth == 0 regardless of number of may_goto insns and the verifier has to explore the program until bpf_exit. Fixes: 011832b9 ("bpf: Introduce may_goto instruction") Reported-by:
Zac Ecob <zacecob@protonmail.com> Signed-off-by:
Alexei Starovoitov <ast@kernel.org> Signed-off-by:
Andrii Nakryiko <andrii@kernel.org> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Acked-by:
Eduard Zingerman <eddyz87@gmail.com> Closes: https://lore.kernel.org/bpf/CAADnVQL-15aNp04-cyHRn47Yv61NXfYyhopyZtUyxNojUZUXpA@mail.gmail.com/ Link: https://lore.kernel.org/bpf/20240619235355.85031-1-alexei.starovoitov@gmail.com
-
- Jun 23, 2024
-
-
Huacai Chen authored
After the rework of "Parallel CPU bringup", the cmdline "nosmp" and "maxcpus=0" parameters are not working anymore. These parameters set setup_max_cpus to zero and that's handed to bringup_nonboot_cpus(). The code there does a decrement before checking for zero, which brings it into the negative space and brings up all CPUs. Add a zero check at the beginning of the function to prevent this. [ tglx: Massaged change log ] Fixes: 18415f33 ("cpu/hotplug: Allow "parallel" bringup up to CPUHP_BP_KICK_AP_STATE") Fixes: 06c6796e ("cpu/hotplug: Fix off by one in cpuhp_bringup_mask()") Signed-off-by:
Huacai Chen <chenhuacai@loongson.cn> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240618081336.3996825-1-chenhuacai@loongson.cn
-
- Jun 21, 2024
-
-
Daniel Borkmann authored
The BPF ring buffer internally is implemented as a power-of-2 sized circular buffer, with two logical and ever-increasing counters: consumer_pos is the consumer counter to show which logical position the consumer consumed the data, and producer_pos which is the producer counter denoting the amount of data reserved by all producers. Each time a record is reserved, the producer that "owns" the record will successfully advance producer counter. In user space each time a record is read, the consumer of the data advanced the consumer counter once it finished processing. Both counters are stored in separate pages so that from user space, the producer counter is read-only and the consumer counter is read-write. One aspect that simplifies and thus speeds up the implementation of both producers and consumers is how the data area is mapped twice contiguously back-to-back in the virtual memory, allowing to not take any special measures for samples that have to wrap around at the end of the circular buffer data area, because the next page after the last data page would be first data page again, and thus the sample will still appear completely contiguous in virtual memory. Each record has a struct bpf_ringbuf_hdr { u32 len; u32 pg_off; } header for book-keeping the length and offset, and is inaccessible to the BPF program. Helpers like bpf_ringbuf_reserve() return `(void *)hdr + BPF_RINGBUF_HDR_SZ` for the BPF program to use. Bing-Jhong and Muhammad reported that it is however possible to make a second allocated memory chunk overlapping with the first chunk and as a result, the BPF program is now able to edit first chunk's header. For example, consider the creation of a BPF_MAP_TYPE_RINGBUF map with size of 0x4000. Next, the consumer_pos is modified to 0x3000 /before/ a call to bpf_ringbuf_reserve() is made. This will allocate a chunk A, which is in [0x0,0x3008], and the BPF program is able to edit [0x8,0x3008]. Now, lets allocate a chunk B with size 0x3000. This will succeed because consumer_pos was edited ahead of time to pass the `new_prod_pos - cons_pos > rb->mask` check. Chunk B will be in range [0x3008,0x6010], and the BPF program is able to edit [0x3010,0x6010]. Due to the ring buffer memory layout mentioned earlier, the ranges [0x0,0x4000] and [0x4000,0x8000] point to the same data pages. This means that chunk B at [0x4000,0x4008] is chunk A's header. bpf_ringbuf_submit() / bpf_ringbuf_discard() use the header's pg_off to then locate the bpf_ringbuf itself via bpf_ringbuf_restore_from_rec(). Once chunk B modified chunk A's header, then bpf_ringbuf_commit() refers to the wrong page and could cause a crash. Fix it by calculating the oldest pending_pos and check whether the range from the oldest outstanding record to the newest would span beyond the ring buffer size. If that is the case, then reject the request. We've tested with the ring buffer benchmark in BPF selftests (./benchs/run_bench_ringbufs.sh) before/after the fix and while it seems a bit slower on some benchmarks, it is still not significantly enough to matter. Fixes: 457f4436 ("bpf: Implement BPF ring buffer and verifier support for it") Reported-by:
Bing-Jhong Billy Jheng <billy@starlabs.sg> Reported-by:
Muhammad Ramdhan <ramdhan@starlabs.sg> Co-developed-by:
Bing-Jhong Billy Jheng <billy@starlabs.sg> Co-developed-by:
Andrii Nakryiko <andrii@kernel.org> Signed-off-by:
Bing-Jhong Billy Jheng <billy@starlabs.sg> Signed-off-by:
Andrii Nakryiko <andrii@kernel.org> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Signed-off-by:
Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240621140828.18238-1-daniel@iogearbox.net
-
Alexei Starovoitov authored
When the following program is processed by the verifier: L1: may_goto L2 goto L1 L2: w0 = 0 exit the may_goto insn is first converted to: L1: r11 = *(u64 *)(r10 -8) if r11 == 0x0 goto L2 r11 -= 1 *(u64 *)(r10 -8) = r11 goto L1 L2: w0 = 0 exit then later as the last step the verifier inserts: *(u64 *)(r10 -8) = BPF_MAX_LOOPS as the first insn of the program to initialize loop count. When the first insn happens to be a branch target of some jmp the bpf_patch_insn_data() logic will produce: L1: *(u64 *)(r10 -8) = BPF_MAX_LOOPS r11 = *(u64 *)(r10 -8) if r11 == 0x0 goto L2 r11 -= 1 *(u64 *)(r10 -8) = r11 goto L1 L2: w0 = 0 exit because instruction patching adjusts all jmps and calls, but for this particular corner case it's incorrect and the L1 label should be one instruction down, like: *(u64 *)(r10 -8) = BPF_MAX_LOOPS L1: r11 = *(u64 *)(r10 -8) if r11 == 0x0 goto L2 r11 -= 1 *(u64 *)(r10 -8) = r11 goto L1 L2: w0 = 0 exit and that's what this patch is fixing. After bpf_patch_insn_data() call adjust_jmp_off() to adjust all jmps that point to newly insert BPF_ST insn to point to insn after. Note that bpf_patch_insn_data() cannot easily be changed to accommodate this logic, since jumps that point before or after a sequence of patched instructions have to be adjusted with the full length of the patch. Conceptually it's somewhat similar to "insert" of instructions between other instructions with weird semantics. Like "insert" before 1st insn would require adjustment of CALL insns to point to newly inserted 1st insn, but not an adjustment JMP insns that point to 1st, yet still adjusting JMP insns that cross over 1st insn (point to insn before or insn after), hence use simple adjust_jmp_off() logic to fix this corner case. Ideally bpf_patch_insn_data() would have an auxiliary info to say where 'the start of newly inserted patch is', but it would be too complex for backport. Fixes: 011832b9 ("bpf: Introduce may_goto instruction") Reported-by:
Zac Ecob <zacecob@protonmail.com> Signed-off-by:
Alexei Starovoitov <ast@kernel.org> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Acked-by:
Eduard Zingerman <eddyz87@gmail.com> Closes: https://lore.kernel.org/bpf/CAADnVQJ_WWx8w4b=6Gc2EpzAjgv+6A0ridnMz2TvS2egj4r3Gw@mail.gmail.com/ Link: https://lore.kernel.org/bpf/20240619011859.79334-1-alexei.starovoitov@gmail.com
-
- Jun 18, 2024
-
-
Alexei Starovoitov authored
The bpf arena logic didn't account for mremap operation. Add a refcnt for multiple mmap events to prevent use-after-free in arena_vm_close. Fixes: 31746031 ("bpf: Introduce bpf_arena.") Reported-by:
Pengfei Xu <pengfei.xu@intel.com> Signed-off-by:
Alexei Starovoitov <ast@kernel.org> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Reviewed-by:
Barret Rhoden <brho@google.com> Tested-by:
Pengfei Xu <pengfei.xu@intel.com> Closes: https://lore.kernel.org/bpf/Zmuw29IhgyPNKnIM@xpf.sh.intel.com Link: https://lore.kernel.org/bpf/20240617171812.76634-1-alexei.starovoitov@gmail.com
-
- Jun 17, 2024
-
-
Yonghong Song authored
In coerce_subreg_to_size_sx(), for the case where upper sign extension bits are the same for smax32 and smin32 values, we missed to setup properly. This is especially problematic if both smax32 and smin32's sign extension bits are 1. The following is a simple example illustrating the inconsistent verifier states due to missed var_off: 0: (85) call bpf_get_prandom_u32#7 ; R0_w=scalar() 1: (bf) r3 = r0 ; R0_w=scalar(id=1) R3_w=scalar(id=1) 2: (57) r3 &= 15 ; R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=15,var_off=(0x0; 0xf)) 3: (47) r3 |= 128 ; R3_w=scalar(smin=umin=smin32=umin32=128,smax=umax=smax32=umax32=143,var_off=(0x80; 0xf)) 4: (bc) w7 = (s8)w3 REG INVARIANTS VIOLATION (alu): range bounds violation u64=[0xffffff80, 0x8f] s64=[0xffffff80, 0x8f] u32=[0xffffff80, 0x8f] s32=[0x80, 0xffffff8f] var_off=(0x80, 0xf) The var_off=(0x80, 0xf) is not correct, and the correct one should be var_off=(0xffffff80; 0xf) since from insn 3, we know that at insn 4, the sign extension bits will be 1. This patch fixed this issue by setting var_off properly. Fixes: 8100928c ("bpf: Support new sign-extension mov insns") Signed-off-by:
Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240615174632.3995278-1-yonghong.song@linux.dev Signed-off-by:
Alexei Starovoitov <ast@kernel.org>
-
Yonghong Song authored
Zac reported a verification failure and Alexei reproduced the issue with a simple reproducer ([1]). The verification failure is due to missed setting for var_off. The following is the reproducer in [1]: 0: R1=ctx() R10=fp0 0: (71) r3 = *(u8 *)(r10 -387) ; R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) R10=fp0 1: (bc) w7 = (s8)w3 ; R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) R7_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=127,var_off=(0x0; 0x7f)) 2: (36) if w7 >= 0x2533823b goto pc-3 mark_precise: frame0: last_idx 2 first_idx 0 subseq_idx -1 mark_precise: frame0: regs=r7 stack= before 1: (bc) w7 = (s8)w3 mark_precise: frame0: regs=r3 stack= before 0: (71) r3 = *(u8 *)(r10 -387) 2: R7_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=127,var_off=(0x0; 0x7f)) 3: (b4) w0 = 0 ; R0_w=0 4: (95) exit Note that after insn 1, the var_off for R7 is (0x0; 0x7f). This is not correct since upper 24 bits of w7 could be 0 or 1. So correct var_off should be (0x0; 0xffffffff). Missing var_off setting in set_sext32_default_val() caused later incorrect analysis in zext_32_to_64(dst_reg) and reg_bounds_sync(dst_reg). To fix the issue, set var_off correctly in set_sext32_default_val(). The correct reg state after insn 1 becomes: 1: (bc) w7 = (s8)w3 ; R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) R7_w=scalar(smin=0,smax=umax=0xffffffff,smin32=-128,smax32=127,var_off=(0x0; 0xffffffff)) and at insn 2, the verifier correctly determines either branch is possible. [1] https://lore.kernel.org/bpf/CAADnVQLPU0Shz7dWV4bn2BgtGdxN3uFHPeobGBA72tpg5Xoykw@mail.gmail.com/ Fixes: 8100928c ("bpf: Support new sign-extension mov insns") Reported-by:
Zac Ecob <zacecob@protonmail.com> Signed-off-by:
Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240615174626.3994813-1-yonghong.song@linux.dev Signed-off-by:
Alexei Starovoitov <ast@kernel.org>
-
Yuntao Wang authored
Commit 4205e478 ("cpu/hotplug: Provide dynamic range for prepare stage") added a dynamic range for the prepare states, but did not handle the assignment of the dynstate variable in __cpuhp_setup_state_cpuslocked(). This causes the corresponding startup callback not to be invoked when calling __cpuhp_setup_state_cpuslocked() with the CPUHP_BP_PREPARE_DYN parameter, even though it should be. Currently, the users of __cpuhp_setup_state_cpuslocked(), for one reason or another, have not triggered this bug. Fixes: 4205e478 ("cpu/hotplug: Provide dynamic range for prepare stage") Signed-off-by:
Yuntao Wang <ytcoode@gmail.com> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240515134554.427071-1-ytcoode@gmail.com
-
- Jun 15, 2024
-
-
Aleksandr Nogikh authored
In kcov_remote_start()/kcov_remote_stop(), we swap the previous KCOV metadata of the current task into a per-CPU variable. However, the kcov_mode_enabled(mode) check is not sufficient in the case of remote KCOV coverage: current->kcov_mode always remains KCOV_MODE_DISABLED for remote KCOV objects. If the original task that has invoked the KCOV_REMOTE_ENABLE ioctl happens to get interrupted and kcov_remote_start() is called, it ultimately leads to kcov_remote_stop() NOT restoring the original KCOV reference. So when the task exits, all registered remote KCOV handles remain active forever. The most uncomfortable effect (at least for syzkaller) is that the bug prevents the reuse of the same /sys/kernel/debug/kcov descriptor. If we obtain it in the parent process and then e.g. drop some capabilities and continuously fork to execute individual programs, at some point current->kcov of the forked process is lost, kcov_task_exit() takes no action, and all KCOV_REMOTE_ENABLE ioctls calls from subsequent forks fail. And, yes, the efficiency is also affected if we keep on losing remote kcov objects. a) kcov_remote_map keeps on growing forever. b) (If I'm not mistaken), we're also not freeing the memory referenced by kcov->area. Fix it by introducing a special kcov_mode that is assigned to the task that owns a KCOV remote object. It makes kcov_mode_enabled() return true and yet does not trigger coverage collection in __sanitizer_cov_trace_pc() and write_comp_data(). [nogikh@google.com: replace WRITE_ONCE() with an ordinary assignment] Link: https://lkml.kernel.org/r/20240614171221.2837584-1-nogikh@google.com Link: https://lkml.kernel.org/r/20240611133229.527822-1-nogikh@google.com Fixes: 5ff3b30a ("kcov: collect coverage from interrupts") Signed-off-by:
Aleksandr Nogikh <nogikh@google.com> Reviewed-by:
Dmitry Vyukov <dvyukov@google.com> Reviewed-by:
Andrey Konovalov <andreyknvl@gmail.com> Tested-by:
Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Marco Elver <elver@google.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Peter Oberparleiter authored
Using gcov on kernels compiled with GCC 14 results in truncated 16-byte long .gcda files with no usable data. To fix this, update GCOV_COUNTERS to match the value defined by GCC 14. Tested with GCC versions 14.1.0 and 13.2.0. Link: https://lkml.kernel.org/r/20240610092743.1609845-1-oberpar@linux.ibm.com Signed-off-by:
Peter Oberparleiter <oberpar@linux.ibm.com> Reported-by:
Allison Henderson <allison.henderson@oracle.com> Reported-by:
Chuck Lever III <chuck.lever@oracle.com> Tested-by:
Chuck Lever <chuck.lever@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Oleg Nesterov authored
kernel_wait4() doesn't sleep and returns -EINTR if there is no eligible child and signal_pending() is true. That is why zap_pid_ns_processes() clears TIF_SIGPENDING but this is not enough, it should also clear TIF_NOTIFY_SIGNAL to make signal_pending() return false and avoid a busy-wait loop. Link: https://lkml.kernel.org/r/20240608120616.GB7947@redhat.com Fixes: 12db8b69 ("entry: Add support for TIF_NOTIFY_SIGNAL") Signed-off-by:
Oleg Nesterov <oleg@redhat.com> Reported-by:
Rachel Menge <rachelmenge@linux.microsoft.com> Closes: https://lore.kernel.org/all/1386cd49-36d0-4a5c-85e9-bc42056a5a38@linux.microsoft.com/ Reviewed-by:
Boqun Feng <boqun.feng@gmail.com> Tested-by:
Wei Fu <fuweid89@gmail.com> Reviewed-by:
Jens Axboe <axboe@kernel.dk> Cc: Allen Pais <apais@linux.microsoft.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Joel Granados <j.granados@samsung.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mike Christie <michael.christie@oracle.com> Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Zqiang <qiang.zhang1211@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- Jun 13, 2024
-
-
GUO Zihua authored
A panic happens in ima_match_policy: BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 PGD 42f873067 P4D 0 Oops: 0000 [#1] SMP NOPTI CPU: 5 PID: 1286325 Comm: kubeletmonit.sh Kdump: loaded Tainted: P Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 RIP: 0010:ima_match_policy+0x84/0x450 Code: 49 89 fc 41 89 cf 31 ed 89 44 24 14 eb 1c 44 39 7b 18 74 26 41 83 ff 05 74 20 48 8b 1b 48 3b 1d f2 b9 f4 00 0f 84 9c 01 00 00 <44> 85 73 10 74 ea 44 8b 6b 14 41 f6 c5 01 75 d4 41 f6 c5 02 74 0f RSP: 0018:ff71570009e07a80 EFLAGS: 00010207 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000200 RDX: ffffffffad8dc7c0 RSI: 0000000024924925 RDI: ff3e27850dea2000 RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffffabfce739 R10: ff3e27810cc42400 R11: 0000000000000000 R12: ff3e2781825ef970 R13: 00000000ff3e2785 R14: 000000000000000c R15: 0000000000000001 FS: 00007f5195b51740(0000) GS:ff3e278b12d40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000010 CR3: 0000000626d24002 CR4: 0000000000361ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ima_get_action+0x22/0x30 process_measurement+0xb0/0x830 ? page_add_file_rmap+0x15/0x170 ? alloc_set_pte+0x269/0x4c0 ? prep_new_page+0x81/0x140 ? simple_xattr_get+0x75/0xa0 ? selinux_file_open+0x9d/0xf0 ima_file_check+0x64/0x90 path_openat+0x571/0x1720 do_filp_open+0x9b/0x110 ? page_counter_try_charge+0x57/0xc0 ? files_cgroup_alloc_fd+0x38/0x60 ? __alloc_fd+0xd4/0x250 ? do_sys_open+0x1bd/0x250 do_sys_open+0x1bd/0x250 do_syscall_64+0x5d/0x1d0 entry_SYSCALL_64_after_hwframe+0x65/0xca Commit c7423dbd ("ima: Handle -ESTALE returned by ima_filter_rule_match()") introduced call to ima_lsm_copy_rule within a RCU read-side critical section which contains kmalloc with GFP_KERNEL. This implies a possible sleep and violates limitations of RCU read-side critical sections on non-PREEMPT systems. Sleeping within RCU read-side critical section might cause synchronize_rcu() returning early and break RCU protection, allowing a UAF to happen. The root cause of this issue could be described as follows: | Thread A | Thread B | | |ima_match_policy | | | rcu_read_lock | |ima_lsm_update_rule | | | synchronize_rcu | | | | kmalloc(GFP_KERNEL)| | | sleep | ==> synchronize_rcu returns early | kfree(entry) | | | | entry = entry->next| ==> UAF happens and entry now becomes NULL (or could be anything). | | entry->action | ==> Accessing entry might cause panic. To fix this issue, we are converting all kmalloc that is called within RCU read-side critical section to use GFP_ATOMIC. Fixes: c7423dbd ("ima: Handle -ESTALE returned by ima_filter_rule_match()") Cc: stable@vger.kernel.org Signed-off-by:
GUO Zihua <guozihua@huawei.com> Acked-by:
John Johansen <john.johansen@canonical.com> Reviewed-by:
Mimi Zohar <zohar@linux.ibm.com> Reviewed-by:
Casey Schaufler <casey@schaufler-ca.com> [PM: fixed missing comment, long lines, !CONFIG_IMA_LSM_RULES case] Signed-off-by:
Paul Moore <paul@paul-moore.com>
-
Maciej Żenczykowski authored
pcpu_hot (defined in arch/x86) is not available on user mode linux (ARCH=um) Cc: Andrii Nakryiko <andrii@kernel.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Alexei Starovoitov <ast@kernel.org> Fixes: 1ae69210 ("bpf: inline bpf_get_smp_processor_id() helper") Signed-off-by:
Maciej Żenczykowski <maze@google.com> Link: https://lore.kernel.org/r/20240613173146.2524647-1-maze@google.com Signed-off-by:
Alexei Starovoitov <ast@kernel.org>
-
Daniel Borkmann authored
The fake_reg moved into env->fake_reg given it consumes a lot of stack space (120 bytes). Migrate the fake_reg in check_stack_write_fixed_off() as well now that we have it. Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20240613115310.25383-2-daniel@iogearbox.net Signed-off-by:
Alexei Starovoitov <ast@kernel.org>
-
Daniel Borkmann authored
Juan reported that after doing some changes to buzzer [0] and implementing a new fuzzing strategy guided by coverage, they noticed the following in one of the probes: [...] 13: (79) r6 = *(u64 *)(r0 +0) ; R0=map_value(ks=4,vs=8) R6_w=scalar() 14: (b7) r0 = 0 ; R0_w=0 15: (b4) w0 = -1 ; R0_w=0xffffffff 16: (74) w0 >>= 1 ; R0_w=0x7fffffff 17: (5c) w6 &= w0 ; R0_w=0x7fffffff R6_w=scalar(smin=smin32=0,smax=umax=umax32=0x7fffffff,var_off=(0x0; 0x7fffffff)) 18: (44) w6 |= 2 ; R6_w=scalar(smin=umin=smin32=umin32=2,smax=umax=umax32=0x7fffffff,var_off=(0x2; 0x7ffffffd)) 19: (56) if w6 != 0x7ffffffd goto pc+1 REG INVARIANTS VIOLATION (true_reg2): range bounds violation u64=[0x7fffffff, 0x7ffffffd] s64=[0x7fffffff, 0x7ffffffd] u32=[0x7fffffff, 0x7ffffffd] s32=[0x7fffffff, 0x7ffffffd] var_off=(0x7fffffff, 0x0) REG INVARIANTS VIOLATION (false_reg1): range bounds violation u64=[0x7fffffff, 0x7ffffffd] s64=[0x7fffffff, 0x7ffffffd] u32=[0x7fffffff, 0x7ffffffd] s32=[0x7fffffff, 0x7ffffffd] var_off=(0x7fffffff, 0x0) REG INVARIANTS VIOLATION (false_reg2): const tnum out of sync with range bounds u64=[0x0, 0xffffffffffffffff] s64=[0x8000000000000000, 0x7fffffffffffffff] u32=[0x0, 0xffffffff] s32=[0x80000000, 0x7fffffff] var_off=(0x7fffffff, 0x0) 19: R6_w=0x7fffffff 20: (95) exit from 19 to 21: R0=0x7fffffff R6=scalar(smin=umin=smin32=umin32=2,smax=umax=smax32=umax32=0x7ffffffe,var_off=(0x2; 0x7ffffffd)) R7=map_ptr(ks=4,vs=8) R9=ctx() R10=fp0 fp-24=map_ptr(ks=4,vs=8) fp-40=mmmmmmmm 21: R0=0x7fffffff R6=scalar(smin=umin=smin32=umin32=2,smax=umax=smax32=umax32=0x7ffffffe,var_off=(0x2; 0x7ffffffd)) R7=map_ptr(ks=4,vs=8) R9=ctx() R10=fp0 fp-24=map_ptr(ks=4,vs=8) fp-40=mmmmmmmm 21: (14) w6 -= 2147483632 ; R6_w=scalar(smin=umin=umin32=2,smax=umax=0xffffffff,smin32=0x80000012,smax32=14,var_off=(0x2; 0xfffffffd)) 22: (76) if w6 s>= 0xe goto pc+1 ; R6_w=scalar(smin=umin=umin32=2,smax=umax=0xffffffff,smin32=0x80000012,smax32=13,var_off=(0x2; 0xfffffffd)) 23: (95) exit from 22 to 24: R0=0x7fffffff R6_w=14 R7=map_ptr(ks=4,vs=8) R9=ctx() R10=fp0 fp-24=map_ptr(ks=4,vs=8) fp-40=mmmmmmmm 24: R0=0x7fffffff R6_w=14 R7=map_ptr(ks=4,vs=8) R9=ctx() R10=fp0 fp-24=map_ptr(ks=4,vs=8) fp-40=mmmmmmmm 24: (14) w6 -= 14 ; R6_w=0 [...] What can be seen here is a register invariant violation on line 19. After the binary-or in line 18, the verifier knows that bit 2 is set but knows nothing about the rest of the content which was loaded from a map value, meaning, range is [2,0x7fffffff] with var_off=(0x2; 0x7ffffffd). When in line 19 the verifier analyzes the branch, it splits the register states in reg_set_min_max() into the registers of the true branch (true_reg1, true_reg2) and the registers of the false branch (false_reg1, false_reg2). Since the test is w6 != 0x7ffffffd, the src_reg is a known constant. Internally, the verifier creates a "fake" register initialized as scalar to the value of 0x7ffffffd, and then passes it onto reg_set_min_max(). Now, for line 19, it is mathematically impossible to take the false branch of this program, yet the verifier analyzes it. It is impossible because the second bit of r6 will be set due to the prior or operation and the constant in the condition has that bit unset (hex(fd) == binary(1111 1101). When the verifier first analyzes the false / fall-through branch, it will compute an intersection between the var_off of r6 and of the constant. This is because the verifier creates a "fake" register initialized to the value of the constant. The intersection result later refines both registers in regs_refine_cond_op(): [...] t = tnum_intersect(tnum_subreg(reg1->var_off), tnum_subreg(reg2->var_off)); reg1->var_off = tnum_with_subreg(reg1->var_off, t); reg2->var_off = tnum_with_subreg(reg2->var_off, t); [...] Since the verifier is analyzing the false branch of the conditional jump, reg1 is equal to false_reg1 and reg2 is equal to false_reg2, i.e. the reg2 is the "fake" register that was meant to hold a constant value. The resulting var_off of the intersection says that both registers now hold a known value of var_off=(0x7fffffff, 0x0) or in other words: this operation manages to make the verifier think that the "constant" value that was passed in the jump operation now holds a different value. Normally this would not be an issue since it should not influence the true branch, however, false_reg2 and true_reg2 are pointers to the same "fake" register. Meaning, the false branch can influence the results of the true branch. In line 24, the verifier assumes R6_w=0, but the actual runtime value in this case is 1. The fix is simply not passing in the same "fake" register location as inputs to reg_set_min_max(), but instead making a copy. Moving the fake_reg into the env also reduces stack consumption by 120 bytes. With this, the verifier successfully rejects invalid accesses from the test program. [0] https://github.com/google/buzzer Fixes: 67420501 ("bpf: generalize reg_set_min_max() to handle non-const register comparisons") Reported-by:
Juan José López Jaimez <jjlopezjaimez@google.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Reviewed-by:
John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20240613115310.25383-1-daniel@iogearbox.net Signed-off-by:
Alexei Starovoitov <ast@kernel.org>
-
- Jun 11, 2024
-
-
Masami Hiramatsu (Google) authored
The kprobes and synth event generation test modules add events and lock (get a reference) those event file reference in module init function, and unlock and delete it in module exit function. This is because those are designed for playing as modules. If we make those modules as built-in, those events are left locked in the kernel, and never be removed. This causes kprobe event self-test failure as below. [ 97.349708] ------------[ cut here ]------------ [ 97.353453] WARNING: CPU: 3 PID: 1 at kernel/trace/trace_kprobe.c:2133 kprobe_trace_self_tests_init+0x3f1/0x480 [ 97.357106] Modules linked in: [ 97.358488] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 6.9.0-g699646734ab5-dirty #14 [ 97.361556] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 [ 97.363880] RIP: 0010:kprobe_trace_self_tests_init+0x3f1/0x480 [ 97.365538] Code: a8 24 08 82 e9 ae fd ff ff 90 0f 0b 90 48 c7 c7 e5 aa 0b 82 e9 ee fc ff ff 90 0f 0b 90 48 c7 c7 2d 61 06 82 e9 8e fd ff ff 90 <0f> 0b 90 48 c7 c7 33 0b 0c 82 89 c6 e8 6e 03 1f ff 41 ff c7 e9 90 [ 97.370429] RSP: 0000:ffffc90000013b50 EFLAGS: 00010286 [ 97.371852] RAX: 00000000fffffff0 RBX: ffff888005919c00 RCX: 0000000000000000 [ 97.373829] RDX: ffff888003f40000 RSI: ffffffff8236a598 RDI: ffff888003f40a68 [ 97.375715] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 [ 97.377675] R10: ffffffff811c9ae5 R11: ffffffff8120c4e0 R12: 0000000000000000 [ 97.379591] R13: 0000000000000001 R14: 0000000000000015 R15: 0000000000000000 [ 97.381536] FS: 0000000000000000(0000) GS:ffff88807dcc0000(0000) knlGS:0000000000000000 [ 97.383813] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 97.385449] CR2: 0000000000000000 CR3: 0000000002244000 CR4: 00000000000006b0 [ 97.387347] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 97.389277] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 97.391196] Call Trace: [ 97.391967] <TASK> [ 97.392647] ? __warn+0xcc/0x180 [ 97.393640] ? kprobe_trace_self_tests_init+0x3f1/0x480 [ 97.395181] ? report_bug+0xbd/0x150 [ 97.396234] ? handle_bug+0x3e/0x60 [ 97.397311] ? exc_invalid_op+0x1a/0x50 [ 97.398434] ? asm_exc_invalid_op+0x1a/0x20 [ 97.399652] ? trace_kprobe_is_busy+0x20/0x20 [ 97.400904] ? tracing_reset_all_online_cpus+0x15/0x90 [ 97.402304] ? kprobe_trace_self_tests_init+0x3f1/0x480 [ 97.403773] ? init_kprobe_trace+0x50/0x50 [ 97.404972] do_one_initcall+0x112/0x240 [ 97.406113] do_initcall_level+0x95/0xb0 [ 97.407286] ? kernel_init+0x1a/0x1a0 [ 97.408401] do_initcalls+0x3f/0x70 [ 97.409452] kernel_init_freeable+0x16f/0x1e0 [ 97.410662] ? rest_init+0x1f0/0x1f0 [ 97.411738] kernel_init+0x1a/0x1a0 [ 97.412788] ret_from_fork+0x39/0x50 [ 97.413817] ? rest_init+0x1f0/0x1f0 [ 97.414844] ret_from_fork_asm+0x11/0x20 [ 97.416285] </TASK> [ 97.417134] irq event stamp: 13437323 [ 97.418376] hardirqs last enabled at (13437337): [<ffffffff8110bc0c>] console_unlock+0x11c/0x150 [ 97.421285] hardirqs last disabled at (13437370): [<ffffffff8110bbf1>] console_unlock+0x101/0x150 [ 97.423838] softirqs last enabled at (13437366): [<ffffffff8108e17f>] handle_softirqs+0x23f/0x2a0 [ 97.426450] softirqs last disabled at (13437393): [<ffffffff8108e346>] __irq_exit_rcu+0x66/0xd0 [ 97.428850] ---[ end trace 0000000000000000 ]--- And also, since we can not cleanup dynamic_event file, ftracetest are failed too. To avoid these issues, build these tests only as modules. Link: https://lore.kernel.org/all/171811263754.85078.5877446624311852525.stgit@devnote2/ Fixes: 9fe41efa ("tracing: Add synth event generation test module") Fixes: 64836248 ("tracing: Add kprobe event command generation test module") Signed-off-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by:
Steven Rostedt (Google) <rostedt@goodmis.org>
-
- Jun 10, 2024
-
-
Oleg Nesterov authored
After the recent commit 5097cbcb ("sched/isolation: Prevent boot crash when the boot CPU is nohz_full") the kernel no longer crashes, but there is another problem. In this case tick_setup_device() calls tick_take_do_timer_from_boot() to update tick_do_timer_cpu and this triggers the WARN_ON_ONCE(irqs_disabled) in smp_call_function_single(). Kill tick_take_do_timer_from_boot() and just use WRITE_ONCE(), the new comment explains why this is safe (thanks Thomas!). Fixes: 08ae95f4 ("nohz_full: Allow the boot CPU to be nohz_full") Signed-off-by:
Oleg Nesterov <oleg@redhat.com> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240528122019.GA28794@redhat.com Link: https://lore.kernel.org/all/20240522151742.GA10400@redhat.com
-
- Jun 05, 2024
-
-
Haifeng Xu authored
In our production environment, we found many hung tasks which are blocked for more than 18 hours. Their call traces are like this: [346278.191038] __schedule+0x2d8/0x890 [346278.191046] schedule+0x4e/0xb0 [346278.191049] perf_event_free_task+0x220/0x270 [346278.191056] ? init_wait_var_entry+0x50/0x50 [346278.191060] copy_process+0x663/0x18d0 [346278.191068] kernel_clone+0x9d/0x3d0 [346278.191072] __do_sys_clone+0x5d/0x80 [346278.191076] __x64_sys_clone+0x25/0x30 [346278.191079] do_syscall_64+0x5c/0xc0 [346278.191083] ? syscall_exit_to_user_mode+0x27/0x50 [346278.191086] ? do_syscall_64+0x69/0xc0 [346278.191088] ? irqentry_exit_to_user_mode+0x9/0x20 [346278.191092] ? irqentry_exit+0x19/0x30 [346278.191095] ? exc_page_fault+0x89/0x160 [346278.191097] ? asm_exc_page_fault+0x8/0x30 [346278.191102] entry_SYSCALL_64_after_hwframe+0x44/0xae The task was waiting for the refcount become to 1, but from the vmcore, we found the refcount has already b...
-
- Jun 03, 2024
-
-
Cong Wang authored
After commit 1a80dbcb, bpf_link can be freed by link->ops->dealloc_deferred, but the code still tests and uses link->ops->dealloc afterward, which leads to a use-after-free as reported by syzbot. Actually, one of them should be sufficient, so just call one of them instead of both. Also add a WARN_ON() in case of any problematic implementation. Fixes: 1a80dbcb ("bpf: support deferring bpf_link dealloc to after RCU grace period") Reported-by:
<syzbot+1989ee16d94720836244@syzkaller.appspotmail.com> Signed-off-by:
Cong Wang <cong.wang@bytedance.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Acked-by:
Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240602182703.207276-1-xiyou.wangcong@gmail.com
-
Thorsten Blum authored
The iterator variable dst cannot be NULL and the if check can be removed. Remove it and fix the following Coccinelle/coccicheck warning reported by itnull.cocci: ERROR: iterator variable bound on line 762 cannot be NULL Signed-off-by:
Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Reviewed-by:
Toke Høiland-Jørgensen <toke@redhat.com> Acked-by:
Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240529101900.103913-2-thorsten.blum@toblux.com
-
- May 31, 2024
-
-
Jiri Olsa authored
The bpf_session_cookie is unavailable for !CONFIG_FPROBE as reported by Sebastian [1]. To fix that we remove CONFIG_FPROBE ifdef for session kfuncs, which is fine, because there's filter for session programs. Then based on bpf_trace.o dependency: obj-$(CONFIG_BPF_EVENTS) += bpf_trace.o we add bpf_session_cookie BTF_ID in special_kfunc_set list dependency on CONFIG_BPF_EVENTS. [1] https://lore.kernel.org/bpf/20240531071557.MvfIqkn7@linutronix.de/T/#m71c6d5ec71db2967288cb79acedc15cc5dbfeec5 Reported-by:
Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by:
Sebastian Andrzej Siewior <bigeasy@linutronix.de> Suggested-by:
Alexei Starovoitov <ast@kernel.org> Fixes: 5c919ace ("bpf: Add support for kprobe session cookie") Signed-off-by:
Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20240531194500.2967187-1-jolsa@kernel.org Signed-off-by:
Alexei Starovoitov <ast@kernel.org>
-
- May 29, 2024
-
-
Miguel Ojeda authored
Commit 13e1df09 ("kheaders: explicitly validate existence of cpio command") added an explicit check for `cpio` using `type`. However, `type` in `dash` (which is used in some popular distributions and base images as the shell script runner) prints the missing message to standard output, and thus no error is printed: $ bash -c 'type missing >/dev/null' bash: line 1: type: missing: not found $ dash -c 'type missing >/dev/null' $ For instance, this issue may be seen by loongarch builders, given its defconfig enables CONFIG_IKHEADERS since commit 9cc1df42 ("LoongArch: Update Loongson-3 default config file"). Therefore, use `command -v` instead to have consistent behavior, and take the chance to provide a more explicit error. Fixes: 13e1df09 ("kheaders: explicitly validate existence of cpio command") Signed-off-by:
Miguel Ojeda <ojeda@kernel.org> Signed-off-by:
Masahiro Yamada <masahiroy@kernel.org>
-
Matthias Maennich authored
Build environments might be running with different umask settings resulting in indeterministic file modes for the files contained in kheaders.tar.xz. The file itself is served with 444, i.e. world readable. Archive the files explicitly with 744,a+X to improve reproducibility across build environments. --mode=0444 is not suitable as directories need to be executable. Also, 444 makes it hard to delete all the readonly files after extraction. Cc: stable@vger.kernel.org Signed-off-by:
Matthias Maennich <maennich@google.com> Signed-off-by:
Masahiro Yamada <masahiroy@kernel.org>
-
- May 27, 2024
-
-
Jakub Sitnicki authored
We have seen an influx of syzkaller reports where a BPF program attached to a tracepoint triggers a locking rule violation by performing a map_delete on a sockmap/sockhash. We don't intend to support this artificial use scenario. Extend the existing verifier allowed-program-type check for updating sockmap/sockhash to also cover deleting from a map. From now on only BPF programs which were previously allowed to update sockmap/sockhash can delete from these map types. Fixes: ff910599 ("bpf, sockmap: Prevent lock inversion deadlock in map delete elem") Reported-by:
Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Reported-by:
<syzbot+ec941d6e24f633a59172@syzkaller.appspotmail.com> Signed-off-by:
Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Tested-by:
<syzbot+ec941d6e24f633a59172@syzkaller.appspotmail.com> Acked-by:
John Fastabend <john.fastabend@gmail.com> Closes: https://syzkaller.appspot.com/bug?extid=ec941d6e24f633a59172 Link: https://lore.kernel.org/bpf/20240527-sockmap-verify-deletes-v1-1-944b372f2101@cloudflare.com
-
Carlos López authored
btf_find_struct_member() might return NULL or an error via the ERR_PTR() macro. However, its caller in parse_btf_field() only checks for the NULL condition. Fix this by using IS_ERR() and returning the error up the stack. Link: https://lore.kernel.org/all/20240527094351.15687-1-clopez@suse.de/ Fixes: c440adfb ("tracing/probes: Support BTF based data structure field access") Signed-off-by:
Carlos López <clopez@suse.de> Signed-off-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org>
-
- May 25, 2024
-
-
Andrii Nakryiko authored
get_pid_task() internally already calls rcu_read_lock() and rcu_read_unlock(), so there is no point to do this one extra time. This is a drive-by improvement and has no correctness implications. Acked-by:
Jiri Olsa <jolsa@kernel.org> Signed-off-by:
Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240521163401.3005045-3-andrii@kernel.org Signed-off-by:
Alexei Starovoitov <ast@kernel.org>
-
Andrii Nakryiko authored
Current implementation of PID filtering logic for multi-uprobes in uprobe_prog_run() is filtering down to exact *thread*, while the intent for PID filtering it to filter by *process* instead. The check in uprobe_prog_run() also differs from the analogous one in uprobe_multi_link_filter() for some reason. The latter is correct, checking task->mm, not the task itself. Fix the check in uprobe_prog_run() to perform the same task->mm check. While doing this, we also update get_pid_task() use to use PIDTYPE_TGID type of lookup, given the intent is to get a representative task of an entire process. This doesn't change behavior, but seems more logical. It would hold task group leader task now, not any random thread task. Last but not least, given multi-uprobe support is half-broken due to this PID filtering logic (depending on whether PID filtering is important or not), we need to make it easy for user space consumers (including libbpf) to easily detect whether PID filtering logic was already fixed. We do it here by adding an early check on passed pid parameter. If it's negative (and so has no chance of being a valid PID), we return -EINVAL. Previous behavior would eventually return -ESRCH ("No process found"), given there can't be any process with negative PID. This subtle change won't make any practical change in behavior, but will allow applications to detect PID filtering fixes easily. Libbpf fixes take advantage of this in the next patch. Cc: stable@vger.kernel.org Acked-by:
Jiri Olsa <jolsa@kernel.org> Fixes: b733eead ("bpf: Add pid filter support for uprobe_multi link") Signed-off-by:
Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240521163401.3005045-2-andrii@kernel.org Signed-off-by:
Alexei Starovoitov <ast@kernel.org>
-
- May 24, 2024
-
-
Christian Brauner authored
Otherwise we can cause spurious EBUSY issues when trying to mount the rootfs later on. Link: https://bugzilla.kernel.org/show_bug.cgi?id=218845 Reported-by:
Petri Kaukasoina <petri.kaukasoina@tuni.fi> Signed-off-by:
Christian Brauner <brauner@kernel.org>
-
dicken.ding authored
irq_find_at_or_after() dereferences the interrupt descriptor which is returned by mt_find() while neither holding sparse_irq_lock nor RCU read lock, which means the descriptor can be freed between mt_find() and the dereference: CPU0 CPU1 desc = mt_find() delayed_free_desc(desc) irq_desc_get_irq(desc) The use-after-free is reported by KASAN: Call trace: irq_get_next_irq+0x58/0x84 show_stat+0x638/0x824 seq_read_iter+0x158/0x4ec proc_reg_read_iter+0x94/0x12c vfs_read+0x1e0/0x2c8 Freed by task 4471: slab_free_freelist_hook+0x174/0x1e0 __kmem_cache_free+0xa4/0x1dc kfree+0x64/0x128 irq_kobj_release+0x28/0x3c kobject_put+0xcc/0x1e0 delayed_free_desc+0x14/0x2c rcu_do_batch+0x214/0x720 Guard the access with a RCU read lock section. Fixes: 721255b9 ("genirq: Use a maple tree for interrupt descriptor m...
-
- May 23, 2024
-
-
Jeff Xu authored
Patch series "Introduce mseal", v10. This patchset proposes a new mseal() syscall for the Linux kernel. In a nutshell, mseal() protects the VMAs of a given virtual memory range against modifications, such as changes to their permission bits. Modern CPUs support memory permissions, such as the read/write (RW) and no-execute (NX) bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves the security stance on memory corruption bugs, as an attacker cannot simply write to arbitrary memory and point the code to it. The memory must be marked with the X bit, or else an exception will occur. Internally, the kernel maintains the memory permissions in a data structure called VMA (vm_area_struct). mseal() additionally protects the VMA itself against modifications of the selected seal type. Memory sealing is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management...
-
Andrii Nakryiko authored
Recent changes made uprobe_cpu_buffer preparation lazy, and moved it deeper into __uprobe_trace_func(). This is problematic because __uprobe_trace_func() is called inside rcu_read_lock()/rcu_read_unlock() block, which then calls prepare_uprobe_buffer() -> uprobe_buffer_get() -> mutex_lock(&ucb->mutex), leading to a splat about using mutex under non-sleepable RCU: BUG: sleeping function called from invalid context at kernel/locking/mutex.c:585 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 98231, name: stress-ng-sigq preempt_count: 0, expected: 0 RCU nest depth: 1, expected: 0 ... Call Trace: <TASK> dump_stack_lvl+0x3d/0xe0 __might_resched+0x24c/0x270 ? prepare_uprobe_buffer+0xd5/0x1d0 __mutex_lock+0x41/0x820 ? ___perf_sw_event+0x206/0x290 ? __perf_event_task_sched_in+0x54/0x660 ? __perf_event_task_sched_in+0x54/0x660 prepare_uprobe_buffer+0xd5/0x1d0 __uprobe_trace_func+0x4a/0x140 uprobe_dispatcher+0x135/0x280 ? uprobe_dispatcher+0x94/0x280 uprobe_notify_resume+0x650/0xec0 ? atomic_notifier_call_chain+0x21/0x110 ? atomic_notifier_call_chain+0xf8/0x110 irqentry_exit_to_user_mode+0xe2/0x1e0 asm_exc_int3+0x35/0x40 RIP: 0033:0x7f7e1d4da390 Code: 33 04 00 0f 1f 80 00 00 00 00 f3 0f 1e fa b9 01 00 00 00 e9 b2 fc ff ff 66 90 f3 0f 1e fa 31 c9 e9 a5 fc ff ff 0f 1f 44 00 00 <cc> 0f 1e fa b8 27 00 00 00 0f 05 c3 0f 1f 40 00 f3 0f 1e fa b8 6e RSP: 002b:00007ffd2abc3608 EFLAGS: 00000246 RAX: 0000000000000000 RBX: 0000000076d325f1 RCX: 0000000000000000 RDX: 0000000076d325f1 RSI: 000000000000000a RDI: 00007ffd2abc3690 RBP: 000000000000000a R08: 00017fb700000000 R09: 00017fb700000000 R10: 00017fb700000000 R11: 0000000000000246 R12: 0000000000017ff2 R13: 00007ffd2abc3610 R14: 0000000000000000 R15: 00007ffd2abc3780 </TASK> Luckily, it's easy to fix by moving prepare_uprobe_buffer() to be called slightly earlier: into uprobe_trace_func() and uretprobe_trace_func(), outside of RCU locked section. This still keeps this buffer preparation lazy and helps avoid the overhead when it's not needed. E.g., if there is only BPF uprobe handler installed on a given uprobe, buffer won't be initialized. Note, the other user of prepare_uprobe_buffer(), __uprobe_perf_func(), is not affected, as it doesn't prepare buffer under RCU read lock. Link: https://lore.kernel.org/all/20240521053017.3708530-1-andrii@kernel.org/ Fixes: 1b8f85de ("uprobes: prepare uprobe args buffer lazily") Reported-by:
Breno Leitao <leitao@debian.org> Signed-off-by:
Andrii Nakryiko <andrii@kernel.org> Signed-off-by:
Masami Hiramatsu (Google) <mhiramat@kernel.org>
-
Dongli Zhang authored
The absence of IRQD_MOVE_PCNTXT prevents immediate effectiveness of interrupt affinity reconfiguration via procfs. Instead, the change is deferred until the next instance of the interrupt being triggered on the original CPU. When the interrupt next triggers on the original CPU, the new affinity is enforced within __irq_move_irq(). A vector is allocated from the new CPU, but the old vector on the original CPU remains and is not immediately reclaimed. Instead, apicd->move_in_progress is flagged, and the reclaiming process is delayed until the next trigger of the interrupt on the new CPU. Upon the subsequent triggering of the interrupt on the new CPU, irq_complete_move() adds a task to the old CPU's vector_cleanup list if it remains online. Subsequently, the timer on the old CPU iterates over its vector_cleanup list, reclaiming old vectors. However, a rare scenario arises if the old CPU is outgoing before the interrupt triggers again on the new CPU. In that case irq_force_complete_move() is not invoked on the outgoing CPU to reclaim the old apicd->prev_vector because the interrupt isn't currently affine to the outgoing CPU, and irq_needs_fixup() returns false. Even though __vector_schedule_cleanup() is later called on the new CPU, it doesn't reclaim apicd->prev_vector; instead, it simply resets both apicd->move_in_progress and apicd->prev_vector to 0. As a result, the vector remains unreclaimed in vector_matrix, leading to a CPU vector leak. To address this issue, move the invocation of irq_force_complete_move() before the irq_needs_fixup() call to reclaim apicd->prev_vector, if the interrupt is currently or used to be affine to the outgoing CPU. Additionally, reclaim the vector in __vector_schedule_cleanup() as well, following a warning message, although theoretically it should never see apicd->move_in_progress with apicd->prev_cpu pointing to an offline CPU. Fixes: f0383c24 ("genirq/cpuhotplug: Add support for cleaning up move in progress") Signed-off-by:
Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240522220218.162423-1-dongli.zhang@oracle.com
-
Fedor Pchelkin authored
cpumask_of_node() can be called for NUMA_NO_NODE inside do_map_benchmark() resulting in the following sanitizer report: UBSAN: array-index-out-of-bounds in ./arch/x86/include/asm/topology.h:72:28 index -1 is out of range for type 'cpumask [64][1]' CPU: 1 PID: 990 Comm: dma_map_benchma Not tainted 6.9.0-rc6 #29 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:117) ubsan_epilogue (lib/ubsan.c:232) __ubsan_handle_out_of_bounds (lib/ubsan.c:429) cpumask_of_node (arch/x86/include/asm/topology.h:72) [inline] do_map_benchmark (kernel/dma/map_benchmark.c:104) map_benchmark_ioctl (kernel/dma/map_benchmark.c:246) full_proxy_unlocked_ioctl (fs/debugfs/file.c:333) __x64_sys_ioctl (fs/ioctl.c:890) do_syscall_64 (arch/x86/entry/common.c:83) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) Use cpumask_of_node() in place when binding a kernel thread to a cpuset of a particular node. Note that t...
-
Fedor Pchelkin authored
While validating node ids in map_benchmark_ioctl(), node_possible() may be provided with invalid argument outside of [0,MAX_NUMNODES-1] range leading to: BUG: KASAN: wild-memory-access in map_benchmark_ioctl (kernel/dma/map_benchmark.c:214) Read of size 8 at addr 1fffffff8ccb6398 by task dma_map_benchma/971 CPU: 7 PID: 971 Comm: dma_map_benchma Not tainted 6.9.0-rc6 #37 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:117) kasan_report (mm/kasan/report.c:603) kasan_check_range (mm/kasan/generic.c:189) variable_test_bit (arch/x86/include/asm/bitops.h:227) [inline] arch_test_bit (arch/x86/include/asm/bitops.h:239) [inline] _test_bit at (include/asm-generic/bitops/instrumented-non-atomic.h:142) [inline] node_state (include/linux/nodemask.h:423) [inline] map_benchmark_ioctl (kernel/dma/map_benchmark.c:214) full_proxy_unlocked_ioctl (fs/debugfs/file.c:333) __x64_sys_ioctl (fs/ioctl.c:890) do_syscall_64 (arch/x86/entry/common.c:83) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) Compare node ids with sane bounds first. NUMA_NO_NODE is considered a special valid case meaning that benchmarking kthreads won't be bound to a cpuset of a given node. Found by Linux Verification Center (linuxtesting.org). Fixes: 65789daa ("dma-mapping: add benchmark support for streaming DMA APIs") Signed-off-by:
Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by:
Robin Murphy <robin.murphy@arm.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-