Re: [PATCH -RFC 0/2] mm/ext4: avoid data corruption when extending DIO write race with buffered read

From: Baokun Li
Date: Mon Dec 11 2023 - 21:16:11 EST


On 2023/12/12 1:49, Jan Kara wrote:
On Thu 07-12-23 22:15:55, Baokun Li wrote:
On 2023/12/7 3:37, Jan Kara wrote:
On Tue 05-12-23 20:50:30, Baokun Li wrote:
On 2023/12/4 22:41, Jan Kara wrote:
On Mon 04-12-23 21:50:18, Baokun Li wrote:
On 2023/12/4 20:11, Jan Kara wrote:
The problem is with a one-master-twoslave MYSQL database with three
physical machines, and using sysbench pressure testing on each of the
three machines, the problem occurs about once every two to three hours.

The problem is with the relay log file, and when the problem occurs, the
middle dozens of bytes of the file are read as all zeros, while the data on
disk is not. This is a journal-like file where a write process gets the data
from
the master node and writes it locally, and another replay process reads the
file and performs the replay operation accordingly (some SQL statements).
The problem is that when replaying, it finds that the data read is
corrupted,
not valid SQL data, while the data on disk is normal.

It's not confirmed that buffered reads vs direct IO writes is actually
causing this issue, but this is the only scenario that we can reproduce
with our local simplified scripts. Also, after merging in patch 1, the
MYSQL pressure test scenario has now been tested for 5 days and has not
been reproduced.

I'll double-check the problem scenario, although buffered reads with
buffered writes doesn't seem to have this problem.
Yeah, from what you write it seems that the replay code is using buffered
reads on the journal file. I guess you could confirm that with a bit of
kernel tracing but the symptoms look pretty convincing. Did you try talking
to MYSQL guys about why they are doing this?
The operations performed on the relay log file are buffered reads and
writes, which I confirmed with the following bpftrace script:
```
#include <linux/fs.h>
#include <linux/path.h>
#include <linux/dcache.h>

kprobe:generic_file_buffered_read /!strncmp(str(((struct kiocb
*)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
    printf("read path: %s\n", str(((struct kiocb
*)arg0)->ki_filp->f_path.dentry->d_name.name));
}

kprobe:ext4_buffered_write_iter /!strncmp(str(((struct kiocb
*)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
    printf("write path: %s\n", str(((struct kiocb
*)arg0)->ki_filp->f_path.dentry->d_name.name));
}
```
I suspect there are DIO writes causing the problem, but I haven't caught
any DIO writes to such files via bpftrace.
Interesting. Not sure how your partially zeroed-out buffers could happen
with fully buffered IO.

After looking at the code again and again, the following concurrency
seems to bypass the memory barrier:

ext4_buffered_write_iter
 generic_perform_write
  copy_page_from_iter_atomic
  ext4_da_write_end
   ext4_da_do_write_end
    block_write_end
     __block_commit_write
      folio_mark_uptodate
       smp_wmb()
       set_bit(PG_uptodate, folio_flags(folio, 0))
    i_size_write(inode, pos + copied)
    // write isize 2048
    unlock_page(page)

ext4_file_read_iter
 generic_file_read_iter
  filemap_read
   filemap_get_pages
    filemap_get_read_batch
    folio_test_uptodate(folio)
     ret = test_bit(PG_uptodate, folio_flags(folio, 0));
     if (ret)
      smp_rmb();
      // The read barrier here ensures
      // that data 0-2048 in the page is synchronized.
                           ext4_buffered_write_iter
                            generic_perform_write
                             copy_page_from_iter_atomic
                             ext4_da_write_end
                              ext4_da_do_write_end
                               block_write_end
                                __block_commit_write
                                 folio_mark_uptodate
                                  smp_wmb()
                                  set_bit(PG_uptodate, folio_flags(folio,
0))
                               i_size_write(inode, pos + copied)
                               // write isize 4096
                               unlock_page(page)
   // read isize 4096
   isize = i_size_read(inode)
   // But there is no read barrier here,
   // so the data in the 2048-4096 range
   // may not be synchronized yet !!!
   copy_page_to_iter()
   // copyout 4096

In the concurrency above, we read the updated i_size, but there is
no read barrier to ensure that the data in the page is the same as
the i_size at this point. Therefore, we may copy the unsynchronized
page out. Is it normal for us to read zero-filled data in this case?
Indeed, I have checked and filemap_read() (but this dates back even to old
do_generic_file_read() code) indeed does copy data only after checking
uptodate flag and then sampling i_size so we may be copying state in the
middle of the racing write and indeed there is nothing which would prevent
prefetching page data before fetching inode size. I agree this is kind of
nasty so I think adding a read barrier between i_size_read() and
copy_page_to_iter() makes sense. Does it fix your issue with MYSQL?

Honza
Thank you very much for confirming this issue!

Adding a read barrier between i_size_read() and copy_page_to_iter()
does seem to have solved the problem. The environment, which was
previously reproducing once every 2-3h, has been running for 60+h
without reproducing after adding the read barrier.

This is a Load-Load reordering issue, which only occurs on some
weak mem-ordering architectures (e.g. ARM64, ALPHA), but not
on strong mem-ordering architectures (e.g. X86).  This issue was
exactly discovered when migrating MYSQL services from X86 to
ARM64.

And theoretically the problem doesn't only happen on ext4,
filesystems that call filemap_read() but don't hold inode lock
(e.g. btrfs, f2fs, ubifs ...) will have this problem, while filesystems
with inode lock (e.g. xfs, nfs) won't have this problem.

Later I will send the patch that adds the memory barrier.

Thanks!
--
With Best Regards,
Baokun Li
.