Re: [dm-devel] dm: Make MIN_IOS, et al, tunable via sysctl.

From: Mikulas Patocka
Date: Tue Aug 20 2013 - 17:41:46 EST




On Mon, 19 Aug 2013, Mike Snitzer wrote:

> On Fri, Aug 16 2013 at 6:55pm -0400,
> Frank Mayhar <fmayhar@xxxxxxxxxx> wrote:
>
> > The device mapper and some of its modules allocate memory pools at
> > various points when setting up a device. In some cases, these pools are
> > fairly large, for example the multipath module allocates a 256-entry
> > pool and the dm itself allocates three of that size. In a
> > memory-constrained environment where we're creating a lot of these
> > devices, the memory use can quickly become significant. Unfortunately,
> > there's currently no way to change the size of the pools other than by
> > changing a constant and rebuilding the kernel.
> >
> > This patch fixes that by changing the hardcoded MIN_IOS (and certain
> > other) #defines in dm-crypt, dm-io, dm-mpath, dm-snap and dm itself to
> > sysctl-modifiable values. This lets us change the size of these pools
> > on the fly, we can reduce the size of the pools and reduce memory
> > pressure.
>
> These memory reserves are a long-standing issue with DM (made worse when
> request-based mpath was introduced). Two years ago, I assembled a patch
> series that took one approach to trying to fix it:
> http://people.redhat.com/msnitzer/patches/upstream/dm-rq-based-mempool-sharing/series.html
>
> But in the end I wasn't convinced sharing the memory reserve would allow
> for 100s of mpath devices to make forward progress if memory is
> depleted.
>
> All said, I think adding the ability to control the size of the memory
> reserves is reasonable. It allows for informed admins to establish
> lower reserves (based on the awareness that rq-based mpath doesn't need
> to support really large IOs, etc) without compromising the ability to
> make forward progress.
>
> But, as mentioned in my porevious mail, I'd like to see this implemnted
> in terms of module_param_named().
>
> > We tested performance of dm-mpath with smaller MIN_IOS sizes for both dm
> > and dm-mpath, from a value of 32 all the way down to zero.
>
> Bio-based can safely be reduced, as this older (uncommitted) patch did:
> http://people.redhat.com/msnitzer/patches/upstream/dm-rq-based-mempool-sharing/0000-dm-lower-bio-based-reservation.patch
>
> > Bearing in mind that the underlying devices were network-based, we saw
> > essentially no performance degradation; if there was any, it was down
> > in the noise. One might wonder why these sizes are the way they are;
> > I investigated and they've been unchanged since at least 2006.
>
> Performance isn't the concern. The concern is: does DM allow for
> forward progress if the system's memory is completely exhausted?

There is one possible deadlock that was introduced in commit
d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 in 2.6.22-rc1. Unfortunatelly, no
one found that bug at that time and now it seems to be hard to revert
that.

The problem is this:

* we send bio1 to the device dm-1, device mapper splits it to bio2 and
bio3 and sends both of them to the device dm-2. These two bios are added
to current->bio_list.

* bio2 is popped off current->bio_list, a mempool entry from device dm-2's
mempool is allocated, bio4 is created and sent to the device dm-3. bio4 is
added to the end of current->bio_list.

* bio3 is popped off current->bio_list, a mempool entry from device dm-2's
mempool is allocated. Suppose that the mempool is exhausted, so we wait
until some existing work (bio2) finishes and returns the entry to the
mempool.

So: bio3's request routine waits until bio2 finishes and refills the
mempool. bio2 is waiting for bio4 to finish. bio4 is in current->bio_list
and is waiting until bio3's request routine fininshes. Deadlock.

In practice, it is not so serious because in mempool_alloc there is:
/*
* FIXME: this should be io_schedule(). The timeout is there as a
* workaround for some DM problems in 2.6.18.
*/
io_schedule_timeout(5*HZ);

- so it waits for 5 seconds and retries. If there is something in the
system that is able to free memory, it resumes.

> This is why request-based has such an extensive reserve, because it
> needs to account for cloning the largest possible request that comes in
> (with multiple bios).

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/