BUG: btrfs: KCSAN: data-race in btrfs_block_rsv_release [btrfs] / btrfs_preempt_reclaim_metadata_space [btrfs]

From: Mirsad Todorovac
Date: Sat Sep 23 2023 - 11:31:24 EST


Hi again,

Adding to the previous bug report, on the vanilla torvalds tree 6.6-rc2 kernel on Ubuntu 22.04 LTS,
there is an additional bug report from KCSAN (please find the config attached):

[146315.465137] ==================================================================
[146315.465747] BUG: KCSAN: data-race in btrfs_block_rsv_release [btrfs] / btrfs_preempt_reclaim_metadata_space [btrfs]

[146315.466957] write to 0xffff888125878128 of 8 bytes by task 40555 on cpu 1:
[146315.466968] btrfs_block_rsv_release (/home/marvin/linux/kernel/torvalds2/fs/btrfs/block-rsv.c:122 /home/marvin/linux/kernel/torvalds2/fs/btrfs/block-rsv.c:293) btrfs
[146315.467591] btrfs_trans_release_metadata (/home/marvin/linux/kernel/torvalds2/fs/btrfs/transaction.c:1005) btrfs
[146315.468191] __btrfs_end_transaction (/home/marvin/linux/kernel/torvalds2/fs/btrfs/transaction.c:1022) btrfs
[146315.468792] btrfs_end_transaction (/home/marvin/linux/kernel/torvalds2/fs/btrfs/transaction.c:1064) btrfs
[146315.469397] btrfs_evict_inode (/home/marvin/linux/kernel/torvalds2/fs/btrfs/inode.c:5274) btrfs
[146315.470000] evict (/home/marvin/linux/kernel/torvalds2/fs/inode.c:664)
[146315.470010] iput.part.0 (/home/marvin/linux/kernel/torvalds2/fs/inode.c:1803)
[146315.470019] iput (/home/marvin/linux/kernel/torvalds2/fs/inode.c:1803 (discriminator 2))
[146315.470027] do_unlinkat (/home/marvin/linux/kernel/torvalds2/fs/namei.c:4407)
[146315.470038] __x64_sys_unlink (/home/marvin/linux/kernel/torvalds2/fs/namei.c:4444)
[146315.470048] do_syscall_64 (/home/marvin/linux/kernel/torvalds2/arch/x86/entry/common.c:50 /home/marvin/linux/kernel/torvalds2/arch/x86/entry/common.c:80)
[146315.470060] entry_SYSCALL_64_after_hwframe (/home/marvin/linux/kernel/torvalds2/arch/x86/entry/entry_64.S:120)

[146315.470077] read to 0xffff888125878128 of 8 bytes by task 259386 on cpu 18:
[146315.470088] btrfs_preempt_reclaim_metadata_space (/home/marvin/linux/kernel/torvalds2/fs/btrfs/space-info.c:1167) btrfs
[146315.470698] process_one_work (/home/marvin/linux/kernel/torvalds2/kernel/workqueue.c:2630)
[146315.470710] worker_thread (/home/marvin/linux/kernel/torvalds2/kernel/workqueue.c:2697 /home/marvin/linux/kernel/torvalds2/kernel/workqueue.c:2784)
[146315.470720] kthread (/home/marvin/linux/kernel/torvalds2/kernel/kthread.c:388)
[146315.470729] ret_from_fork (/home/marvin/linux/kernel/torvalds2/arch/x86/kernel/process.c:147)
[146315.470741] ret_from_fork_asm (/home/marvin/linux/kernel/torvalds2/arch/x86/entry/entry_64.S:312)

[146315.470756] value changed: 0x0000000000080000 -> 0x0000000000000000

[146315.470770] Reported by Kernel Concurrency Sanitizer on:
[146315.470778] CPU: 18 PID: 259386 Comm: kworker/u65:12 Tainted: G L 6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0 #41
[146315.470791] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
[146315.470798] Workqueue: events_unbound btrfs_preempt_reclaim_metadata_space [btrfs]
[146315.471413] ==================================================================

<SMALLTALK>
Note that the first report with the btrfs_calculate_inode_block_rsv_size [btrfs] / btrfs_use_block_rsv [btrfs]
"EXPERIMENTAL PATCH" thread addressed the issue of the read section reading various parts of the block_rsv
which was not done atomically, so I think there could be a real data-race if a part of the structure is changed
by another thread while reading the structure members and making the execution path decisions based on their value.

I hate locks, but this seems like a place for a read lock to ensure the integrity of data while processing
the structure members. Unless locking was taken somewhere previously before calling all those functions
in the stacktrace ... But I don't have that deep insight into the btrfs module (150 K lines).

Of course, I did not run the kernel build w/o your review, even when the logic seems OK.

(But the patch was fixing the symptom and not the cause, so I realise that adding such would make the module
bloated and unstable ... However, somebody wise said that "more eyes see more" ...).
</SMALLTALK>

However, encouraged by your care for btrfs file system integrity I have decided to submit the
3rd bug report.

This time, and for your convenience, I took the liberty of just reviewing the KCSAN-suspected lines:

The write side:

fs/btrfs/block-rsv.c
====================
105 static u64 block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
106 struct btrfs_block_rsv *block_rsv,
107 struct btrfs_block_rsv *dest, u64 num_bytes,
108 u64 *qgroup_to_release_ret)
109 {
110 struct btrfs_space_info *space_info = block_rsv->space_info;
111 u64 qgroup_to_release = 0;
112 u64 ret;
113
114 spin_lock(&block_rsv->lock);
115 if (num_bytes == (u64)-1) {
116 num_bytes = block_rsv->size;
117 qgroup_to_release = block_rsv->qgroup_rsv_size;
118 }
119 block_rsv->size -= num_bytes;
120 if (block_rsv->reserved >= block_rsv->size) {
121 num_bytes = block_rsv->reserved - block_rsv->size;
→ 122 block_rsv->reserved = block_rsv->size;
123 block_rsv->full = true;
124 } else {
125 num_bytes = 0;
126 }
127 if (qgroup_to_release_ret &&
128 block_rsv->qgroup_rsv_reserved >= block_rsv->qgroup_rsv_size) {
129 qgroup_to_release = block_rsv->qgroup_rsv_reserved -
130 block_rsv->qgroup_rsv_size;
131 block_rsv->qgroup_rsv_reserved = block_rsv->qgroup_rsv_size;
132 } else {
133 qgroup_to_release = 0;
134 }
135 spin_unlock(&block_rsv->lock);
136
137 ret = num_bytes;
138 if (num_bytes > 0) {
139 if (dest) {
140 spin_lock(&dest->lock);
141 if (!dest->full) {
142 u64 bytes_to_add;
143
144 bytes_to_add = dest->size - dest->reserved;
145 bytes_to_add = min(num_bytes, bytes_to_add);
146 dest->reserved += bytes_to_add;
147 if (dest->reserved >= dest->size)
148 dest->full = true;
149 num_bytes -= bytes_to_add;
150 }
151 spin_unlock(&dest->lock);
152 }
153 if (num_bytes)
154 btrfs_space_info_free_bytes_may_use(fs_info,
155 space_info,
156 num_bytes);
157 }
158 if (qgroup_to_release_ret)
159 *qgroup_to_release_ret = qgroup_to_release;
160 return ret;
161 }

The read side:

fs/btrfs/space-info.c
=====================
1133 static void btrfs_preempt_reclaim_metadata_space(struct work_struct *work)
1134 {
1135 struct btrfs_fs_info *fs_info;
1136 struct btrfs_space_info *space_info;
1137 struct btrfs_block_rsv *delayed_block_rsv;
1138 struct btrfs_block_rsv *delayed_refs_rsv;
1139 struct btrfs_block_rsv *global_rsv;
1140 struct btrfs_block_rsv *trans_rsv;
1141 int loops = 0;
1142
1143 fs_info = container_of(work, struct btrfs_fs_info,
1144 preempt_reclaim_work);
1145 space_info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
1146 delayed_block_rsv = &fs_info->delayed_block_rsv;
1147 delayed_refs_rsv = &fs_info->delayed_refs_rsv;
1148 global_rsv = &fs_info->global_block_rsv;
1149 trans_rsv = &fs_info->trans_block_rsv;
1150
1151 spin_lock(&space_info->lock);
1152 while (need_preemptive_reclaim(fs_info, space_info)) {
1153 enum btrfs_flush_state flush;
1154 u64 delalloc_size = 0;
1155 u64 to_reclaim, block_rsv_size;
1156 u64 global_rsv_size = global_rsv->reserved;
1157
1158 loops++;
1159
1160 /*
1161 * We don't have a precise counter for the metadata being
1162 * reserved for delalloc, so we'll approximate it by subtracting
1163 * out the block rsv's space from the bytes_may_use. If that
1164 * amount is higher than the individual reserves, then we can
1165 * assume it's tied up in delalloc reservations.
1166 */
→ 1167 block_rsv_size = global_rsv_size +
→ 1168 delayed_block_rsv->reserved +
→ 1169 delayed_refs_rsv->reserved +
→ 1170 trans_rsv->reserved;
1171 if (block_rsv_size < space_info->bytes_may_use)
1172 delalloc_size = space_info->bytes_may_use - block_rsv_size;
1173
1174 /*
1175 * We don't want to include the global_rsv in our calculation,
1176 * because that's space we can't touch. Subtract it from the
1177 * block_rsv_size for the next checks.
1178 */
1179 block_rsv_size -= global_rsv_size;
1180
1181 /*
1182 * We really want to avoid flushing delalloc too much, as it
1183 * could result in poor allocation patterns, so only flush it if
1184 * it's larger than the rest of the pools combined.
1185 */
1186 if (delalloc_size > block_rsv_size) {
1187 to_reclaim = delalloc_size;
1188 flush = FLUSH_DELALLOC;
1189 } else if (space_info->bytes_pinned >
1190 (delayed_block_rsv->reserved +
1191 delayed_refs_rsv->reserved)) {
1192 to_reclaim = space_info->bytes_pinned;
1193 flush = COMMIT_TRANS;
1194 } else if (delayed_block_rsv->reserved >
1195 delayed_refs_rsv->reserved) {
1196 to_reclaim = delayed_block_rsv->reserved;
1197 flush = FLUSH_DELAYED_ITEMS_NR;
1198 } else {
1199 to_reclaim = delayed_refs_rsv->reserved;
1200 flush = FLUSH_DELAYED_REFS_NR;
1201 }
1202
1203 spin_unlock(&space_info->lock);
1204
1205 /*
1206 * We don't want to reclaim everything, just a portion, so scale
1207 * down the to_reclaim by 1/4. If it takes us down to 0,
1208 * reclaim 1 items worth.
1209 */
1210 to_reclaim >>= 2;
1211 if (!to_reclaim)
1212 to_reclaim = btrfs_calc_insert_metadata_size(fs_info, 1);
1213 flush_space(fs_info, space_info, to_reclaim, flush, true);
1214 cond_resched();
1215 spin_lock(&space_info->lock);
1216 }
1217
1218 /* We only went through once, back off our clamping. */
1219 if (loops == 1 && !space_info->reclaim_size)
1220 space_info->clamp = max(1, space_info->clamp - 1);
1221 trace_btrfs_done_preemptive_reclaim(fs_info, space_info);
1222 spin_unlock(&space_info->lock);
1223 }

It beats me ... if all the locks were acquired, where did data-race occur?

btrfs_preempt_reclaim_metadata_space() was executed from the work queue, so it is obviously asynchronous ...

Somewhat innocuous, in btrfs_preempt_reclaim_metadata_space() called from the work queue, space_info->lock
is held, while in the path of:

write to 0xffff888125878128 of 8 bytes by task 40555 on cpu 1:
btrfs_block_rsv_release (fs/btrfs/block-rsv.c:122 /home/marvin/linux/kernel/torvalds2/fs/btrfs/block-rsv.c:293) btrfs
btrfs_trans_release_metadata (fs/btrfs/transaction.c:1005) btrfs
__btrfs_end_transaction (fs/btrfs/transaction.c:1022) btrfs
btrfs_end_transaction (fs/btrfs/transaction.c:1064) btrfs
btrfs_evict_inode (fs/btrfs/inode.c:5274) btrfs
evict (fs/inode.c:664)

This lock isn't being acquired:

$ grep 'space_info->lock' -r fs/btrfs/block-rsv.c fs/btrfs/transaction.c fs/btrfs/inode.c fs/inode.c
$

fs/btrfs/block-rsv.c: block_rsv_release_bytes() acquired block_rsv->lock, but not space_info->lock.

Now Vulcan logic says that there could be concurrency there while processing:

→ 1167 block_rsv_size = global_rsv_size +
→ 1168 delayed_block_rsv->reserved +
→ 1169 delayed_refs_rsv->reserved +
→ 1170 trans_rsv->reserved;

but I could also be missing something obvious.

I find the btrfs code so challenging, and especially I am anticipating the RCU part.

Forgive me this rant, but filesystems are so interesting and such a great exercise for my little grey cells ;-)

Best regards,
Mirsad Todorovac

Attachment: config-6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0.xz
Description: application/xz