- Jul 26, 2024
-
-
Huang Xiaojia authored
FS_IOC_GETFSSYSFSPATH ioctl exposes /sys/fs path of a given filesystem, potentially standarizing sysfs reporting. This patch add support for FS_IOC_GETFSSYSFSPATH for erofs, "erofs/<dev>" will be outputted for bdev cases, "erofs/[domain_id,]<fs_id>" will be outputted for fscache cases. Signed-off-by:
Huang Xiaojia <huangxiaojia2@huawei.com> Link: https://lore.kernel.org/r/20240720082335.441563-1-huangxiaojia2@huawei.com Reviewed-by:
Chao Yu <chao@kernel.org> Signed-off-by:
Gao Xiang <hsiangkao@linux.alibaba.com>
-
Gao Xiang authored
In z_erofs_get_gbuf(), the current task may be migrated to another CPU between `z_erofs_gbuf_id()` and `spin_lock(&gbuf->lock)`. Therefore, z_erofs_put_gbuf() will trigger the following issue which was found by stress test: <2>[772156.434168] kernel BUG at fs/erofs/zutil.c:58! .. <4>[772156.435007] <4>[772156.439237] CPU: 0 PID: 3078 Comm: stress Kdump: loaded Tainted: G E 6.10.0-rc7+ #2 <4>[772156.439239] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 1.0.0 01/01/2017 <4>[772156.439241] pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) <4>[772156.439243] pc : z_erofs_put_gbuf+0x64/0x70 [erofs] <4>[772156.439252] lr : z_erofs_lz4_decompress+0x600/0x6a0 [erofs] .. <6>[772156.445958] stress (3127): drop_caches: 1 <4>[772156.446120] Call trace: <4>[772156.446121] z_erofs_put_gbuf+0x64/0x70 [erofs] <4>[772156.446761] z_erofs_lz4_decompress+0x600/0x6a0 [erofs] <4>[772156.446897] z_erofs_decompress_queue+0x740/0xa10 [erofs] <4>[772156.447036] z_erofs_runqueue+0x428/0x8c0 [erofs] <4>[772156.447160] z_erofs_readahead+0x224/0x390 [erofs] .. Fixes: f36f3010 ("erofs: rename per-CPU buffers to global buffer pool and make it configurable") Cc: <stable@vger.kernel.org> # 6.10+ Reviewed-by:
Chunhai Guo <guochunhai@vivo.com> Reviewed-by:
Sandeep Dhavale <dhavale@google.com> Reviewed-by:
Chao Yu <chao@kernel.org> Signed-off-by:
Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240722035110.3456740-1-hsiangkao@linux.alibaba.com
-
Hongbo Li authored
Add support for STATX_DIOALIGN to EROFS, so that direct I/O alignment restrictions are exposed to userspace in a generic way. [Before] ``` ./statx_test /mnt/erofs/testfile statx(/mnt/erofs/testfile) = 0 dio mem align:0 dio offset align:0 ``` [After] ``` ./statx_test /mnt/erofs/testfile statx(/mnt/erofs/testfile) = 0 dio mem align:512 dio offset align:512 ``` Signed-off-by:
Hongbo Li <lihongbo22@huawei.com> Reviewed-by:
Chao Yu <chao@kernel.org> Signed-off-by:
Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240718083243.2485437-1-hsiangkao@linux.alibaba.com
-
- Jul 13, 2024
-
-
Dan Carpenter authored
Smatch complains that: fs/erofs/zdata.c:1047 z_erofs_scan_folio() error: uninitialized symbol 'err'. The issue is if we hit this (!(map->m_flags & EROFS_MAP_MAPPED)) { condition then "err" isn't set. It's inside a loop so we would have to hit that condition on every iteration. Initialize "err" to zero to solve this. Fixes: 5b9654ef ("erofs: teach z_erofs_scan_folios() to handle multi-page folios") Signed-off-by:
Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/f78ab50e-ed6d-4275-8dd4-a4159fa565a2@stanley.mountain Signed-off-by:
Gao Xiang <hsiangkao@linux.alibaba.com>
-
- Jul 11, 2024
-
-
Gao Xiang authored
LZ4 always reuses the decompressed buffer as its LZ77 sliding window (dynamic dictionary) for optimal performance. However, in specific cases, the output buffer may not fully contain valid page cache pages, resulting in the use of short-lived pages for temporary purposes. Due to the limited sliding window size, LZ4 shortlived bounce pages can also be reused in a sliding manner, so each bounce page can be vmapped multiple times in different relative positions by design. In order to avoiding double frees, currently, reuse counts are recorded via page refcount, but it will no longer be used as-is in the future world of Memdescs. Just maintain a lookup table to check if a shortlived page is reused. Signed-off-by:
Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240711053659.1364989-1-hsiangkao@linux.alibaba.com
-
- Jul 10, 2024
-
-
Hongzhen Luo authored
Consolidate them under erofs_map_blocks_* for simplicity since we have many other ways to know if a given inode is compressed or not. Signed-off-by:
Hongzhen Luo <hongzhen@linux.alibaba.com> Reviewed-by:
Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240710083459.208362-1-hongzhen@linux.alibaba.com Signed-off-by:
Gao Xiang <hsiangkao@linux.alibaba.com>
-
- Jul 09, 2024
-
-
Gao Xiang authored
Just use a generic helper to prepare buffers for all supported stream decompressors, eliminating similar logic. Signed-off-by:
Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240709094106.3018109-3-hsiangkao@linux.alibaba.com
-
Gao Xiang authored
Introduce z_erofs_{init,exit}_decompressor() to unexport z_erofs_{deflate,lzma,zstd}_{init,exit}(). Besides, call them in z_erofs_{init,exit}_subsystem() for simplicity. Signed-off-by:
Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240709094106.3018109-2-hsiangkao@linux.alibaba.com
-
Gao Xiang authored
Thus *_config() function declarations can be avoided. Signed-off-by:
Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240709094106.3018109-1-hsiangkao@linux.alibaba.com
-
- Jul 08, 2024
-
-
Gao Xiang authored
After revisiting the design, I believe `struct z_erofs_bvec` should be page-based instead of folio-based due to the reasons below: - The minimized memory mapping block is a page; - Under the certain circumstances, only temporary pages needs to be used instead of folios since refcount, mapcount for such pages are unnecessary; - Decompressors handle all types of pages including temporary pages, not only folios. When handling `struct z_erofs_bvec`, all folio-related information is now accessed using the page_folio() helper. The final goal of this round adaptation is to eliminate direct accesses to `struct page` in the EROFS codebase, except for some exceptions like `z_erofs_is_shortlived_page()` and `z_erofs_page_is_invalidated()`, which require a new helper to determine the memdesc type of an arbitrary page. Actually large folios of compressed files seem to work now, yet I tend to conduct more tests before officially enabling this for all scenarios. Signed-off-by:
Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240703120051.3653452-4-hsiangkao@linux.alibaba.com
-
Gao Xiang authored
Previously, a folio just contains one page. In order to enable large folios, z_erofs_scan_folios() needs to handle multi-page folios. First, this patch eliminates all gotos. Instead, the new loop deal with multiple parts in each folio. It's simple to handle the parts which belong to unmapped extents or fragment extents; but for encoded extents, the page boundaries needs to be considered for `tight` and `split` to keep inplace I/Os work correctly: when a part crosses the page boundary, they needs to be reseted properly. Besides, simplify `tight` derivation since Z_EROFS_PCLUSTER_HOOKED has been removed for quite a while. Signed-off-by:
Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240703120051.3653452-3-hsiangkao@linux.alibaba.com
-
Gao Xiang authored
Just a straight-forward conversion. No logic changes. Signed-off-by:
Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240703120051.3653452-2-hsiangkao@linux.alibaba.com
-
Gao Xiang authored
Unlike `pagecache_get_page()`, `__filemap_get_folio()` returns error pointers instead of NULL, thus switching to `IS_ERR_OR_NULL`. Apart from that, it's just a straightforward conversion. Signed-off-by:
Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240703120051.3653452-1-hsiangkao@linux.alibaba.com
-
- Jul 03, 2024
-
-
Boris Burkov authored
Another improper use of __folio_put() in an error path after freshly allocating pages/folios which returns them with the refcount initialized to 1. The refactor from __free_pages() -> __folio_put() (instead of folio_put) removed a refcount decrement found in __free_pages() and folio_put but absent from __folio_put(). Fixes: 13df3775 ("btrfs: cleanup metadata page pointer usage") CC: stable@vger.kernel.org # 6.8+ Tested-by:
Ed Tomlinson <edtoml@gmail.com> Reviewed-by:
Filipe Manana <fdmanana@suse.com> Signed-off-by:
Boris Burkov <boris@bur.io> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Boris Burkov authored
The conversion to folios switched __free_page() to __folio_put() in the error path in btrfs_do_encoded_write(). However, this gets the page refcounting wrong. If we do hit that error path (I reproduced by modifying btrfs_do_encoded_write to pretend to always fail in a way that jumps to out_folios and running the fstests case btrfs/281), then we always hit the following BUG freeing the folio: BUG: Bad page state in process btrfs pfn:40ab0b page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x61be5 pfn:0x40ab0b flags: 0x5ffff0000000000(node=0|zone=2|lastcpupid=0x1ffff) raw: 05ffff0000000000 0000000000000000 dead000000000122 0000000000000000 raw: 0000000000061be5 0000000000000000 00000001ffffffff 0000000000000000 page dumped because: nonzero _refcount Call Trace: <TASK> dump_stack_lvl+0x3d/0xe0 bad_page+0xea/0xf0 free_unref_page+0x8e1/0x900 ? __mem_cgroup_uncharge+0x69/0x90 __folio_put+0xe6/0x190 btrfs_do_encoded_write+0x445/0x780 ? current_time+0x25/0xd0 btrfs_do_write_iter+0x2cc/0x4b0 btrfs_ioctl_encoded_write+0x2b6/0x340 It turns out __free_page() decreases the page reference count while __folio_put() does not. Switch __folio_put() to folio_put() which decreases the folio reference count first. Fixes: 400b172b ("btrfs: compression: migrate compression/decompression paths to folios") Tested-by:
Ed Tomlinson <edtoml@gmail.com> Reviewed-by:
Qu Wenruo <wqu@suse.com> Reviewed-by:
Filipe Manana <fdmanana@suse.com> Signed-off-by:
Boris Burkov <boris@bur.io> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Ryusuke Konishi authored
If the bitmap block that manages the inode allocation status is corrupted, nilfs_ifile_create_inode() may allocate a new inode from the reserved inode area where it should not be allocated. Previous fix commit d325dc6e ("nilfs2: fix use-after-free bug of struct nilfs_root"), fixed the problem that reserved inodes with inode numbers less than NILFS_USER_INO (=11) were incorrectly reallocated due to bitmap corruption, but since the start number of non-reserved inodes is read from the super block and may change, in which case inode allocation may occur from the extended reserved inode area. If that happens, access to that inode will cause an IO error, causing the file system to degrade to an error state. Fix this potential issue by adding a wraparound option to the common metadata object allocation routine and by modifying nilfs_ifile_create_inode() to disable the option so that it only allocates inodes with inode numbers greater than or equal to the inode number read in "nilfs->ns_first_ino", regardless of the bitmap status of reserved inodes. Link: https://lkml.kernel.org/r/20240623051135.4180-4-konishi.ryusuke@gmail.com Signed-off-by:
Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Ryusuke Konishi authored
Syzbot reported that mounting and unmounting a specific pattern of corrupted nilfs2 filesystem images causes a use-after-free of metadata file inodes, which triggers a kernel bug in lru_add_fn(). As Jan Kara pointed out, this is because the link count of a metadata file gets corrupted to 0, and nilfs_evict_inode(), which is called from iput(), tries to delete that inode (ifile inode in this case). The inconsistency occurs because directories containing the inode numbers of these metadata files that should not be visible in the namespace are read without checking. Fix this issue by treating the inode numbers of these internal files as errors in the sanity check helper when reading directory folios/pages. Also thanks to Hillf Danton and Matthew Wilcox for their initial mm-layer analysis. Link: https://lkml.kernel.org/r/20240623051135.4180-3-konishi.ryusuke@gmail.com Signed-off-by:
Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by:
<syzbot+d79afb004be235636ee8@syzkaller.appspotmail.com> Closes: https://syzkaller.appspot.com/bug?extid=d79afb004be235636ee8 Reported-by:
Jan Kara <jack@suse.cz> Closes: https://lkml.kernel.org/r/20240617075758.wewhukbrjod5fp5o@quack3 Tested-by:
Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
Ryusuke Konishi authored
Patch series "nilfs2: fix potential issues related to reserved inodes". This series fixes one use-after-free issue reported by syzbot, caused by nilfs2's internal inode being exposed in the namespace on a corrupted filesystem, and a couple of flaws that cause problems if the starting number of non-reserved inodes written in the on-disk super block is intentionally (or corruptly) changed from its default value. This patch (of 3): In the current implementation of nilfs2, "nilfs->ns_first_ino", which gives the first non-reserved inode number, is read from the superblock, but its lower limit is not checked. As a result, if a number that overlaps with the inode number range of reserved inodes such as the root directory or metadata files is set in the super block parameter, the inode number test macros (NILFS_MDT_INODE and NILFS_VALID_INODE) will not function properly. In addition, these test macros use left bit-shift calculations using with the inode number as the shift count via the BIT macro, but the result of a shift calculation that exceeds the bit width of an integer is undefined in the C specification, so if "ns_first_ino" is set to a large value other than the default value NILFS_USER_INO (=11), the macros may potentially malfunction depending on the environment. Fix these issues by checking the lower bound of "nilfs->ns_first_ino" and by preventing bit shifts equal to or greater than the NILFS_USER_INO constant in the inode number test macros. Also, change the type of "ns_first_ino" from signed integer to unsigned integer to avoid the need for type casting in comparisons such as the lower bound check introduced this time. Link: https://lkml.kernel.org/r/20240623051135.4180-1-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/20240623051135.4180-2-konishi.ryusuke@gmail.com Signed-off-by:
Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- Jul 02, 2024
-
-
David Howells authored
cifs_expand_read() is causing a performance regression of around 30% by causing extra pagecache to be allocated for an inode in the readahead path before we begin actually dispatching RPC requests, thereby delaying the actual I/O. The expansion is sized according to the rsize parameter, which seems to be 4MiB on my test system; this is a big step up from the first requests made by the fio test program. Simple repro (look at read bandwidth number): fio --name=writetest --filename=/xfstest.test/foo --time_based --runtime=60 --size=16M --numjobs=1 --rw=read Fix this by removing cifs_expand_readahead(). Readahead expansion is mostly useful for when we're using the local cache if the local cache has a block size greater than PAGE_SIZE, so we can dispense with it when not caching. Fixes: 69c3c023 ("cifs: Implement netfslib hooks") Signed-off-by:
David Howells <dhowells@redhat.com> Acked-by:
Paulo Alcantara (Red Hat) <pc@manguebit.com> cc: Jeff Layton <jlayton@kernel.org> cc: Matthew Wilcox <willy@infradead.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org Signed-off-by:
Steve French <stfrench@microsoft.com>
-
Chen Ni authored
Replace a comma between expression statements by a semicolon. Signed-off-by:
Chen Ni <nichen@iscas.ac.cn> Link: https://lore.kernel.org/r/20240702024055.1411407-1-nichen@iscas.ac.cn/ Link: https://lore.kernel.org/r/20240702024055.1411407-1-nichen@iscas.ac.cn Acked-by:
David Howells <dhowells@redhat.com> Signed-off-by:
Christian Brauner <brauner@kernel.org>
-
Christian Brauner authored
Jan reported that 'cd ..' may take a long time in deep directory hierarchies under a bind-mount. If concurrent renames happen it is possible to livelock in is_subdir() because it will keep retrying. Change is_subdir() from simply retrying over and over to retry once and then acquire the rename lock to handle deep ancestor chains better. The list of alternatives to this approach were less then pleasant. Change the scope of rcu lock to cover the whole walk while at it. A big thanks to Jan and Linus. Both Jan and Linus had proposed effectively the same thing just that one version ended up being slightly more elegant. Reported-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Christian Brauner <brauner@kernel.org>
-
Jann Horn authored
When fcntl_setlk() races with close(), it removes the created lock with do_lock_file_wait(). However, LSMs can allow the first do_lock_file_wait() that created the lock while denying the second do_lock_file_wait() that tries to remove the lock. In theory (but AFAIK not in practice), posix_lock_file() could also fail to remove a lock due to GFP_KERNEL allocation failure (when splitting a range in the middle). After the bug has been triggered, use-after-free reads will occur in lock_get_status() when userspace reads /proc/locks. This can likely be used to read arbitrary kernel memory, but can't corrupt kernel memory. This only affects systems with SELinux / Smack / AppArmor / BPF-LSM in enforcing mode and only works from some security contexts. Fix it by calling locks_remove_posix() instead, which is designed to reliably get rid of POSIX locks associated with the given file and files_struct and is also used by filp_flush(). Fixes: c293621b ("[PATCH] stale POSIX lock handling") Cc: stable@kernel.org Link: https://bugs.chromium.org/p/project-zero/issues/detail?id=2563 Signed-off-by:
Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20240702-fs-lock-recover-2-v1-1-edd456f63789@google.com Reviewed-by:
Jeff Layton <jlayton@kernel.org> Signed-off-by:
Christian Brauner <brauner@kernel.org>
-
Filipe Manana authored
In the ref-verify tool, when processing the inline references of an extent item, we may end up returning with uninitialized return value, because: 1) The 'ret' variable is not initialized if there are no inline extent references ('ptr' == 'end' before the while loop starts); 2) If we find an extent owner inline reference we don't initialize 'ret'. So fix these cases by initializing 'ret' to 0 when declaring the variable and set it to -EINVAL if we find an extent owner inline references and simple quotas are not enabled (as well as print an error message). Reported-by:
Mirsad Todorovac <mtodorovac69@gmail.com> Link: https://lore.kernel.org/linux-btrfs/59b40ebe-c824-457d-8b24-0bbca69d472b@gmail.com/ Signed-off-by:
Filipe Manana <fdmanana@suse.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
[BUG] Syzbot reports the following regression detected by KASAN: BUG: KASAN: slab-out-of-bounds in btrfs_qgroup_inherit+0x42e/0x2e20 fs/btrfs/qgroup.c:3277 Read of size 8 at addr ffff88814628ca50 by task syz-executor318/5171 CPU: 0 PID: 5171 Comm: syz-executor318 Not tainted 6.10.0-rc2-syzkaller-00010-g2ab795141095 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114 print_address_description mm/kasan/report.c:377 [inline] print_report+0x169/0x550 mm/kasan/report.c:488 kasan_report+0x143/0x180 mm/kasan/report.c:601 btrfs_qgroup_inherit+0x42e/0x2e20 fs/btrfs/qgroup.c:3277 create_pending_snapshot+0x1359/0x29b0 fs/btrfs/transaction.c:1854 create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1922 btrfs_commit_transaction+0xf20/0x3740 fs/btrfs/transaction.c:2382 create_snapshot+0x6a1/0x9e0 fs/btrfs/ioctl.c:875 btrfs_mksubvol+0x58f/0x710 fs/btrfs/ioctl.c:1029 btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1075 __btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1340 btrfs_ioctl_snap_create_v2+0x1f2/0x3a0 fs/btrfs/ioctl.c:1422 btrfs_ioctl+0x99e/0xc60 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:907 [inline] __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:893 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fcbf1992509 RSP: 002b:00007fcbf1928218 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007fcbf1a1f618 RCX: 00007fcbf1992509 RDX: 0000000020000280 RSI: 0000000050009417 RDI: 0000000000000003 RBP: 00007fcbf1a1f610 R08: 00007ffea1298e97 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fcbf19eb660 R13: 00000000200002b8 R14: 00007fcbf19e60c0 R15: 0030656c69662f2e </TASK> And it also pinned it down to commit b5357cb2 ("btrfs: qgroup: do not check qgroup inherit if qgroup is disabled"). [CAUSE] That offending commit skips the whole qgroup inherit check if qgroup is not enabled. But that also skips the very basic checks like num_ref_copies/num_excl_copies and the structure size checks. Meaning if a qgroup enable/disable race is happening at the background, and we pass a btrfs_qgroup_inherit structure when the qgroup is disabled, the check would be completely skipped. Then at the time of transaction commitment, qgroup is re-enabled and btrfs_qgroup_inherit() is going to use the incorrect structure and causing the above KASAN error. [FIX] Make btrfs_qgroup_check_inherit() only skip the source qgroup checks. So that even if invalid btrfs_qgroup_inherit structure is passed in, we can still reject invalid ones no matter if qgroup is enabled or not. Furthermore we do already have an extra safety inside btrfs_qgroup_inherit(), which would just ignore invalid qgroup sources, so even if we only skip the qgroup source check we're still safe. Reported-by:
<syzbot+a0d1f7e26910be4dc171@syzkaller.appspotmail.com> Fixes: b5357cb2 ("btrfs: qgroup: do not check qgroup inherit if qgroup is disabled") Reviewed-by:
Boris Burkov <boris@bur.io> Reviewed-by:
Jeongjun Park <aha310510@gmail.com> Signed-off-by:
Qu Wenruo <wqu@suse.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Naohiro Aota authored
calc_available_free_space() returns the total size of metadata (or system) block groups, which can be allocated from unallocated disk space. The logic is wrong on zoned mode in two places. First, the calculation of data_chunk_size is wrong. We always allocate one zone as one chunk, and no partial allocation of a zone. So, we should use zone_size (= data_sinfo->chunk_size) as it is. Second, the result "avail" may not be zone aligned. Since we always allocate one zone as one chunk on zoned mode, returning non-zone size aligned bytes will result in less pressure on the async metadata reclaim process. This is serious for the nearly full state with a large zone size device. Allowing over-commit too much will result in less async reclaim work and end up in ENOSPC. We can align down to the zone size to avoid that. Fixes: cb6cbab7 ("btrfs: adjust overcommit logic when very close to full") CC: stable@vger.kernel.org # 6.9 Signed-off-by:
Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by:
Boris Burkov <boris@bur.io> Signed-off-by:
David Sterba <dsterba@suse.com>
-
- Jul 01, 2024
-
-
Naohiro Aota authored
There is a potential parallel list adding for retrying in btrfs_reclaim_bgs_work and adding to the unused list. Since the block group is removed from the reclaim list and it is on a relocation work, it can be added into the unused list in parallel. When that happens, adding it to the reclaim list will corrupt the list head and trigger list corruption like below. Fix it by taking fs_info->unused_bgs_lock. [177.504][T2585409] BTRFS error (device nullb1): error relocating ch= unk 2415919104 [177.514][T2585409] list_del corruption. next->prev should be ff1100= 0344b119c0, but was ff11000377e87c70. (next=3Dff110002390cd9c0) [177.529][T2585409] ------------[ cut here ]------------ [177.537][T2585409] kernel BUG at lib/list_debug.c:65! [177.545][T2585409] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI [177.555][T2585409] CPU: 9 PID: 2585409 Comm: kworker/u128:2 Tainted: G W 6.10.0-rc5-kts #1 [177.568][T2585409] Hardware name: Supermicro SYS-520P-WTR/X12SPW-TF, BIOS 1.2 02/14/2022 [177.579][T2585409] Workqueue: events_unbound btrfs_reclaim_bgs_work[btrfs] [177.589][T2585409] RIP: 0010:__list_del_entry_valid_or_report.cold+0x70/0x72 [177.624][T2585409] RSP: 0018:ff11000377e87a70 EFLAGS: 00010286 [177.633][T2585409] RAX: 000000000000006d RBX: ff11000344b119c0 RCX:0000000000000000 [177.644][T2585409] RDX: 000000000000006d RSI: 0000000000000008 RDI:ffe21c006efd0f40 [177.655][T2585409] RBP: ff110002e0509f78 R08: 0000000000000001 R09:ffe21c006efd0f08 [177.665][T2585409] R10: ff11000377e87847 R11: 0000000000000000 R12:ff110002390cd9c0 [177.676][T2585409] R13: ff11000344b119c0 R14: ff110002e0508000 R15:dffffc0000000000 [177.687][T2585409] FS: 0000000000000000(0000) GS:ff11000fec880000(0000) knlGS:0000000000000000 [177.700][T2585409] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [177.709][T2585409] CR2: 00007f06bc7b1978 CR3: 0000001021e86005 CR4:0000000000771ef0 [177.720][T2585409] DR0: 0000000000000000 DR1: 0000000000000000 DR2:0000000000000000 [177.731][T2585409] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:0000000000000400 [177.742][T2585409] PKRU: 55555554 [177.748][T2585409] Call Trace: [177.753][T2585409] <TASK> [177.759][T2585409] ? __die_body.cold+0x19/0x27 [177.766][T2585409] ? die+0x2e/0x50 [177.772][T2585409] ? do_trap+0x1ea/0x2d0 [177.779][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72 [177.788][T2585409] ? do_error_trap+0xa3/0x160 [177.795][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72 [177.805][T2585409] ? handle_invalid_op+0x2c/0x40 [177.812][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72 [177.820][T2585409] ? exc_invalid_op+0x2d/0x40 [177.827][T2585409] ? asm_exc_invalid_op+0x1a/0x20 [177.834][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72 [177.843][T2585409] btrfs_delete_unused_bgs+0x3d9/0x14c0 [btrfs] There is a similar retry_list code in btrfs_delete_unused_bgs(), but it is safe, AFAICS. Since the block group was in the unused list, the used bytes should be 0 when it was added to the unused list. Then, it checks block_group->{used,reserved,pinned} are still 0 under the block_group->lock. So, they should be still eligible for the unused list, not the reclaim list. The reason it is safe there it's because because we're holding space_info->groups_sem in write mode. That means no other task can allocate from the block group, so while we are at deleted_unused_bgs() it's not possible for other tasks to allocate and deallocate extents from the block group, so it can't be added to the unused list or the reclaim list by anyone else. The bug can be reproduced by btrfs/166 after a few rounds. In practice this can be hit when relocation cannot find more chunk space and ends with ENOSPC. Reported-by:
Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Suggested-by:
Johannes Thumshirn <Johannes.Thumshirn@wdc.com> Fixes: 4eb4e85c ("btrfs: retry block group reclaim without infinite loop") CC: stable@vger.kernel.org # 5.15+ Reviewed-by:
Filipe Manana <fdmanana@suse.com> Reviewed-by:
Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by:
Qu Wenruo <wqu@suse.com> Signed-off-by:
Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
- Jun 29, 2024
-
-
Gao Xiang authored
Sometimes, the on-disk metadata might be invalid due to user interrupts, storage failures, or other unknown causes. In that case, z_erofs_map_blocks_iter() may still return a valid m_llen while other fields remain invalid (e.g., m_plen can be 0). Due to the return value of z_erofs_scan_folio() in some path will be ignored on purpose, the following z_erofs_scan_folio() could then use the invalid value by accident. Let's reset m_llen to 0 to prevent this. Link: https://lore.kernel.org/r/20240629185743.2819229-1-hsiangkao@linux.alibaba.com Signed-off-by:
Gao Xiang <hsiangkao@linux.alibaba.com>
-
- Jun 26, 2024
-
-
David Howells authored
Fix netfs_page_mkwrite() to use filemap_fdatawrite_range(), not filemap_fdatawait_range() to flush conflicting data. Fixes: 102a7e2c ("netfs: Allow buffered shared-writeable mmap through netfs_page_mkwrite()") Signed-off-by:
David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/614300.1719228243@warthog.procyon.org.uk cc: Matthew Wilcox <willy@infradead.org> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: linux-cifs@vger.kernel.org cc: linux-mm@kvack.org cc: linux-fsdevel@vger.kernel.org Signed-off-by:
Christian Brauner <brauner@kernel.org>
-
David Howells authored
Fix netfs_page_mkwrite() to check that folio->mapping is valid once it has taken the folio lock (as filemap_page_mkwrite() does). Without this, generic/247 occasionally oopses with something like the following: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page RIP: 0010:trace_event_raw_event_netfs_folio+0x61/0xc0 ... Call Trace: <TASK> ? __die_body+0x1a/0x60 ? page_fault_oops+0x6e/0xa0 ? exc_page_fault+0xc2/0xe0 ? asm_exc_page_fault+0x22/0x30 ? trace_event_raw_event_netfs_folio+0x61/0xc0 trace_netfs_folio+0x39/0x40 netfs_page_mkwrite+0x14c/0x1d0 do_page_mkwrite+0x50/0x90 do_pte_missing+0x184/0x200 __handle_mm_fault+0x42d/0x500 handle_mm_fault+0x121/0x1f0 do_user_addr_fault+0x23e/0x3c0 exc_page_fault+0xc2/0xe0 asm_exc_page_fault+0x22/0x30 This is due to the invalidate_inode_pages2_range() issued at the end of the DIO write interfering with the mmap'd writes. Fixes: 102a7e2c ("netfs: Allow buffered shared-writeable mmap through netfs_page_mkwrite()") Signed-off-by:
David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/780211.1719318546@warthog.procyon.org.uk Reviewed-by:
Jeff Layton <jlayton@kernel.org> cc: Matthew Wilcox <willy@infradead.org> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: linux-cifs@vger.kernel.org cc: linux-mm@kvack.org cc: linux-fsdevel@vger.kernel.org Signed-off-by:
Christian Brauner <brauner@kernel.org>
-
David Howells authored
Delete some xarray-based buffer wangling functions that are intended for use with bounce buffering, but aren't used because bounce-buffering got deferred to a later patch series. Now, however, the intention is to use something other than an xarray to do this. Signed-off-by:
David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240620173137.610345-9-dhowells@redhat.com Signed-off-by:
Christian Brauner <brauner@kernel.org>
-
David Howells authored
During the writeback procedure, at the end of netfs_write_folio(), pending write operations are flushed if the amount of write-streaming data stored in a page is less than the size of the folio because if we haven't modified a folio to the end, it cannot be contiguous with the following folio... except if the dirty region of the folio is right at the end of the folio space. Fix the test to take the offset into the folio into account as well, such that if the dirty region runs right up to the end of the folio, we leave the flushing for later. Fixes: 288ace2f ("netfs: New writeback implementation") Signed-off-by:
David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: Eric Van Hensbergen <ericvh@kernel.org> cc: Latchesar Ionkov <lucho@ionkov.net> cc: Dominique Martinet <asmadeus@codewreck.org> cc: Christian Schoenebeck <linux_oss@crudebyte.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> (DFS, global name space) cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240620173137.610345-4-dhowells@redhat.com Signed-off-by:
Christian Brauner <brauner@kernel.org>
-
David Howells authored
[This was included in v2 of 9b038d00, but v1 got pushed instead] Fix netfs_unbuffered_write_iter_locked() to set the total request length in the netfs_io_request struct rather than leaving it as zero. Fixes: 288ace2f ("netfs: New writeback implementation") Signed-off-by:
David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: Steve French <stfrench@microsoft.com> cc: Enzo Matsumiya <ematsumiya@suse.de> cc: Christian Brauner <brauner@kernel.org> cc: netfs@lists.linux.dev cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: linux-cifs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20240620173137.610345-2-dhowells@redhat.com Signed-off-by:
Christian Brauner <brauner@kernel.org>
-
Huang Xiaojia authored
FS_IOC_GETFSUUID ioctl exposes the uuid of a filesystem. To support the ioctl, init sb->s_uuid with super_set_uuid(). Signed-off-by:
Huang Xiaojia <huangxiaojia2@huawei.com> Reviewed-by:
Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by:
Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20240624063704.2476070-1-huangxiaojia2@huawei.com Signed-off-by:
Gao Xiang <hsiangkao@linux.alibaba.com>
-
Sandeep Dhavale authored
Because we incorrectly reused of variable `i` in `z_erofs_gbuf_exit()` for inner loop, we may exit early from outer loop resulting in memory leak. Fix this by using separate variable for iterating through inner loop. Fixes: f36f3010 ("erofs: rename per-CPU buffers to global buffer pool and make it configurable") Signed-off-by:
Sandeep Dhavale <dhavale@google.com> Reviewed-by:
Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by:
Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20240624220206.3373197-1-dhavale@google.com Signed-off-by:
Gao Xiang <hsiangkao@linux.alibaba.com>
-
Darrick J. Wong authored
xfs_init_new_inode ignores the init_xattrs parameter for filesystems that do not have ATTR enabled. As a result, the first init_xattrs file to be created by the kernel will not have an attr fork created to store acls. Storing that first acl will add ATTR to the superblock flags, so subsequent files will be created with attr forks. The overhead of this is so small that chances are that nobody has noticed this behavior. However, this is disastrous on a filesystem with parent pointers because it requires that a new linkable file /must/ have a pre-existing attr fork, and the parent pointers code uses init_xattrs to create that fork. The preproduction version of mkfs.xfs used to set this, but the V5 sb verifier only requires ATTR2, not ATTR. There is no guard for filesystems with (PARENT && !ATTR). It turns out that I misunderstood the two flags -- ATTR means that we at some point created an attr fork to store xattrs in a file; ATTR2 apparently means only that inodes have dynamic fork offsets or that the filesystem was mounted with the "attr2" option. Fixes: 2442ee15 ("xfs: eager inode attr fork init needs attr feature awareness") Signed-off-by:
Darrick J. Wong <djwong@kernel.org> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Chandan Babu R <chandanbabu@kernel.org>
-
Darrick J. Wong authored
The kernel reads userspace's buffer but does not write it back. Therefore this is really an _IOW ioctl. Change this before 6.10 final releases. Signed-off-by:
Darrick J. Wong <djwong@kernel.org> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Chandan Babu R <chandanbabu@kernel.org>
-
Darrick J. Wong authored
For a very very long time, inode inactivation has set the inode size to zero before unmapping the extents associated with the data fork. Unfortunately, commit 3c6f46ea changed the inode verifier to prohibit zero-length symlinks and directories. If an inode happens to get logged in this state and the system crashes before freeing the inode, log recovery will also fail on the broken inode. Therefore, allow zero-size symlinks and directories as long as the link count is zero; nobody will be able to open these files by handle so there isn't any risk of data exposure. Fixes: 3c6f46ea ("xfs: sanity check directory inode di_size") Signed-off-by:
Darrick J. Wong <djwong@kernel.org> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Chandan Babu R <chandanbabu@kernel.org>
-
Darrick J. Wong authored
xfs/205 produces the following failure when always_cow is enabled: --- a/tests/xfs/205.out 2024-02-28 16:20:24.437887970 -0800 +++ b/tests/xfs/205.out.bad 2024-06-03 21:13:40.584000000 -0700 @@ -1,4 +1,5 @@ QA output created by 205 *** one file + !!! disk full (expected) *** one file, a few bytes at a time *** done This is the result of overly aggressive attempts to align cow fork delalloc reservations to the CoW extent size hint. Looking at the trace data, we're trying to append a single fsblock to the "fred" file. Trying to create a speculative post-eof reservation fails because there's not enough space. We then set @prealloc_blocks to zero and try again, but the cowextsz alignment code triggers, which expands our request for a 1-fsblock reservation into a 39-block reservation. There's not enough space for that, so the whole write fails with ENOSPC even though there's sufficient space in the filesystem to allocate the single block that we need to land the write. There are two things wrong here -- first, we shouldn't be attempting speculative preallocations beyond what was requested when we're low on space. Second, if we've already computed a posteof preallocation, we shouldn't bother trying to align that to the cowextsize hint. Fix both of these problems by adding a flag that only enables the expansion of the delalloc reservation to the cowextsize if we're doing a non-extending write, and only if we're not doing an ENOSPC retry. This requires us to move the ENOSPC retry logic to xfs_bmapi_reserve_delalloc. I probably should have caught this six years ago when 6ca30729 was being reviewed, but oh well. Update the comments to reflect what the code does now. Fixes: 6ca30729 ("xfs: bmap code cleanup") Signed-off-by:
Darrick J. Wong <djwong@kernel.org> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Chandan Babu R <chandanbabu@kernel.org>
-
Christoph Hellwig authored
xfs_can_free_eofblocks returns false for files that have persistent preallocations unless the force flag is passed and there are delayed blocks. This means it won't free delalloc reservations for files with persistent preallocations unless the force flag is set, and it will also free the persistent preallocations if the force flag is set and the file happens to have delayed allocations. Both of these are bad, so do away with the force flag and always free only post-EOF delayed allocations for files with the XFS_DIFLAG_PREALLOC or APPEND flags set. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Darrick J. Wong <djwong@kernel.org> Signed-off-by:
Chandan Babu R <chandanbabu@kernel.org>
-
- Jun 25, 2024
-
-
Pei Li authored
When allocating too huge a snapshot table, we should fail gracefully in __snapshot_t_mut() instead of fail in kmalloc(). Reported-by:
<syzbot+770e99b65e26fa023ab1@syzkaller.appspotmail.com> Closes: https://syzkaller.appspot.com/bug?extid=770e99b65e26fa023ab1 Tested-by:
<syzbot+770e99b65e26fa023ab1@syzkaller.appspotmail.com> Signed-off-by:
Pei Li <peili.dev@gmail.com> Signed-off-by:
Kent Overstreet <kent.overstreet@linux.dev>
-