Re: [RFC PATCH] 9p: forbid use of mempool for TFLUSH

From: Kent Overstreet
Date: Wed Jul 13 2022 - 03:40:20 EST


On 7/13/22 03:12, Dominique Martinet wrote:
Kent Overstreet wrote on Wed, Jul 13, 2022 at 02:39:06AM -0400:
On 7/13/22 00:17, Dominique Martinet wrote:
TFLUSH is called while the thread still holds memory for the
request we're trying to flush, so mempool alloc can deadlock
there. With p9_msg_buf_size() rework the flush allocation is
small so just make it fail if allocation failed; all that does
is potentially leak the request we're flushing until its reply
finally does come.. or if it never does until umount.

Why not just add separate mempools for flushes? We don't have to allocate
memory for big payloads so they won't cost much, and then the IO paths will
be fully mempool-ified :)

I don't think it really matters either way -- I'm much more worried
about the two points I gave in the commit comment section: mempools not
being shared leading to increased memory usage when many mostly-idle
mounts (I know users who need that), and more importantly that the
mempool waiting is uninterruptible/non-failible might be "nice" from the
using mempool side but I'd really prefer users to be able to ^C out of
a mount made on a bad server getting stuck in mempool_alloc at least.

We should never get stuck allocating memory - if that happens, game over, system can no longer make forward progress.

(oh, that does give me an idea: Suren just implemented a code tagging mechanism for tracking memory allocations by callsite, and I was talking about using it for tracking latency. Memory allocation latency would be a great thing to measure, it's something we care about and we haven't had a good way of measuring it before).

It looked good before I realized all the ways this could hang, but now I
just think for something like 9p it makes more sense to fail the
allocation and the IO than to bounce forever trying to allocate memory
we don't have.

A filesystem that randomly fails IOs is, fundamentally, not a filesystem that _works_. This whole thing started because 9pfs failing IOs has been breaking my xfstests runs - and 9pfs isn't the thing I'm trying to test!

Local filesystems and the local IO stack have always had this understanding - that IO needs to _just work_ and be able to make forward progress without allocating additional memory, otherwise everything falls over because memory reclaim requires doing IO. It's fundamentally no different with network filesystems, the cultural expectation just hasn't been there historically and not for any good technical reason - it's just that in -net land dropping packets is generally a fine thing to do when you have to - but it's really not in filesystem land, not if you want to make something that's reliable under memory pressure!

Let's first see if you still see if you still see high order allocation
failures when these are made much less likely with Chritisan's patch.

Which patch is that? Unless you're talking about my mempool patch?

What I intend to push this cycle is in
https://github.com/martinetd/linux/commits/9p-test
up to 'net/9p: allocate appropriate reduced message buffers'; if you can
easily produce them I'd appreciate if you could confirm if it helps.

(just waiting for Chritian's confirmation + adjusting the strcmp for
rdma before I push it to 9p-next)
--
Dominique