Need help tracking down a bug in the bio-FOLL_PIN patches

From: David Howells
Date: Mon Feb 06 2023 - 18:03:48 EST


Hi Jens, Christoph,

I need some help tracking down a bug in the patches that make the bio using
page pinning or no pinning using iov_iter_extract_pages(). The bug causes
seemingly random memory corruption once the "block: Convert
bio_iov_iter_get_pages to use iov_iter_extract_pages" patch is applied.

The bug was detected by a syzbot special:

https://lore.kernel.org/r/000000000000b0b3c005f3a09383@xxxxxxxxxx/

The basic test body creates/opens a file, truncates it, opens it again
O_DIRECT and then uses sendfile to copy from the file to itself, causing the
file to extend as it goes. I've added a reduced testcase below. Note that
the problem only seems to occur if several instances of the test are run in
parallel. After a few iterations, random memory corruption start showing up
and I see things like:

syz-direct-send[6095]: segfault at 0 ip 0000000000000000 sp 00007ffc81488b28 error 14 in syz-direct-sendfile[400000+1000] likely on CPU 0 (core 0, socket 0)
Code: Unable to access opcode bytes at 0xffffffffffffffd6.
BUG: Bad rss-counter state mm:00000000d5d20a39 type:MM_FILEPAGES val:2
BUG: Bad rss-counter state mm:00000000d5d20a39 type:MM_ANONPAGES val:2
BUG: non-zero pgtables_bytes on freeing mm: 8192

The bug goes away if the file is not truncated, O_DIRECT is not used or two
different files are used.

I've investigated the splice and iov_iter code and looked at what sendfile()
is doing in this case:

(1) sendfile creates buffer pages and adds them into a pipe, does an, in this
case, DIO read into those pages, then calls the fs write_iter to write
the data to the file.

(2) iov_iter_extract_pages() does not get refs/pins on the pages extracted
from an ITER_PIPE iterator - but it shouldn't need to as the pipe holds
the refs. These pages are passed to DIO read - this op is synchronous,
so any bios associated with it should be complete.

(3) I enabled the page_ref tracepoints and added a page flag to limit it to
pages allocated by append_pipe(). This shows the buffer pipe pages being
added and I made it dump the list of them in __bio_release_pages() (which
I made non-optional in bio_release_pages()).

(4) I added some extra page_ref_set tracepoints with weird "val" parameters
to add markers into the log.

(5) I added a tracepoint to trace the lifetime of a bio struct and a flag to
turn on the tracing, set when the pageflag added in (3) is seen. Most of
the time I can see the bio being destroyed in the correct order with
regard to the splice code, though occasionally there's a bit missing.

(6) Substituting a fixed preallocated page for the page coming out of the
pipe in iter_file_splice_write() doesn't get rid of the bug:

- array[n].bv_page = buf->page;
+ array[n].bv_page = splice_tmp;

(7) Getting an extra ref on the buffer pipe page and deliberately leaking it
gets rid of the problem.

(8) Substituting a fixed preallocated page for the page sent to the DIO read
in iov_iter_extract_pipe_pages() gets rid of the problem. The pages
going through the pipe seem to passed to write_iter with no issues.

(9) I've tried instrumenting kmap() and co. to catch debug-marked pages being
accessed after they've been released, but didn't see anything. This
might not catch if DMA is doing the corrupting.

(10) On the notion that DMA might do the corrupting, I've tried adding a
permanent ref on the pages, adding them to a list and scanning them
occasionally - but that doesn't catch anything.

(11) KASAN doesn't spot anything interesting - which might also suggest
DMA-based corruption. But since we're dealing with the contents of
pages, not the page structs themselves (I think), I'm not sure kasan
would see spot anything.

I'm wondering if the apparent interaction with sendfile/splice is actually a
red herring and that the page turnover that that induces is having an effect.

One thing I don't see is how commenting out ftruncate() should cause the
problem to go away if it's something to do with the splice buffer pipe -
though I guess ftruncate() would release a bunch of pages.

Here's an excerpt from a trace of something I'd expect to see:

page_ref_set: pfn=0x10e38c flags=debug_mark count=1 mapcount=0 mapping=0 mt=0 val=777
page_ref_set: pfn=0x10e38c flags=debug_mark count=1 mapcount=0 mapping=0 mt=0 val=666
bio: bio=00038d84 ADD-PG I=10e38c
bio: bio=00038d84 END-IO I=0
page_ref_set: pfn=0x10e38c flags=debug_mark count=1 mapcount=0 mapping=0 mt=0 val=623
bio: bio=00038d84 UNINIT I=0
bio_endio: bio=00038d84 iomap_dio_bio_end_io+0x0/0xec
bio: bio=00038d84 REL-PG I=0
page_ref_set: pfn=0x10e38c flags=debug_mark count=1 mapcount=0 mapping=0 mt=0 val=980
bio: bio=00038d84 FREE I=0
bio: bio=00038d84 UNINIT I=0
page_ref_set: pfn=0x10e38c flags=debug_mark count=1 mapcount=0 mapping=0 mt=0 val=888
page_ref_mod_and_test: pfn=0x10e38c flags=debug_mark count=0 mapcount=0 mapping=0 mt=0 val=-1 ret=1

The weird val=N codes on page_ref_set lines are:
777 - The page iov_iter_extract_pipe_pages() got from append_pipe()
666 - __bio_add_page() adding a page
623 - bio_endio() logging a page
98n - __bio_release_pages() logging the nth page
888 - iter_file_splice_write() adding page to array[]

But occasionally I'm seeing something like:

page_ref_set: pfn=0x1102df flags=debug_mark count=1 mapcount=0 mapping=0000000000000000 mt=0 val=777
page_ref_set: pfn=0x1102df flags=debug_mark count=1 mapcount=0 mapping=0000000000000000 mt=0 val=666
bio: bio=0000e514 ADD-PG I=1102df
page_ref_set: pfn=0x1102df flags=debug_mark count=1 mapcount=0 mapping=0000000000000000 mt=0 val=888
page_ref_mod_and_test: pfn=0x1102df flags=debug_mark count=0 mapcount=0 mapping=0000000000000000 mt=0 val=-1 ret=1

though I'm not sure why. Could it be an attempt to read beyond the EOF? I
don't see the bio being torn down, but the page is passed to
iter_file_splice_write() and released, despite for all I know still with
outstanding I/O pending. Another possibility is that the bio flag got
cleared.

David
---
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/sendfile.h>
#include <sys/wait.h>

#define file_size 0x800
#define send_size 0x1dd00
#define repeat_count 1000

static char buffer[send_size];

int main(int argc, char *argv[])
{
int in, out, i, wt;

if (argc != 2 || !argv[1][0]) {
fprintf(stderr, "Usage: %s <file>\n", argv[0]);
exit(2);
}

for (i = 0; i < repeat_count; i++) {
switch (fork()) {
case -1:
perror("fork");
exit(1);
case 0:
out = creat(argv[1], 0666);
if (out < 0) {
perror(argv[1]);
exit(1);
}

if (ftruncate(out, file_size) < 0) {
perror("ftruncate");
exit(1);
}

if (lseek(out, 0x200, SEEK_SET) < 0) {
perror("lseek");
exit(1);
}

in = open(argv[1], O_RDONLY | O_DIRECT | O_NOFOLLOW);
if (in < 0) {
perror("open");
exit(1);
}

if (sendfile(out, in, NULL, send_size) < 0) {
perror("sendfile");
exit(1);
}
exit(0);

default:
if (wait(&wt) < 0) {
perror("wait");
exit(1);
}
break;
}
}

exit(0);
}