Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)

From: James Bottomley
Date: Thu Aug 13 2009 - 15:18:29 EST

Next message: Greg KH: "Re: xterm loses data (pts regression?)"
Previous message: stephane eranian: "Re: perf_counters issue with PERF_SAMPLE_GROUP"
In reply to: Greg Freemyer: "Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)"
Next in thread: Richard Sharpe: "Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, 2009-08-13 at 14:15 -0400, Greg Freemyer wrote:
> On Thu, Aug 13, 2009 at 12:33 PM, <david@xxxxxxx> wrote:
> > On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
> >
> >> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
> >>>
> >>> I am planning a complete overhaul of the discard work. Users can send
> >>> down discard requests as frequently as they like. The block layer will
> >>> cache them, and invalidate them if writes come through. Periodically,
> >>> the block layer will send down a TRIM or an UNMAP (depending on the
> >>> underlying device) and get rid of the blocks that have remained unwanted
> >>> in the interim.
> >>
> >> That is a very good idea. I've tested your original TRIM implementation on
> >> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
> >> milliseconds to digest a single TRIM command. And since your
> >> implementation
> >> sends a TRIM for each extent of each deleted file, the whole system is
> >> unusable after a short while.
> >> An optimal solution would be to consolidate the discard requests, bundle
> >> them and send them to the drive as infrequent as possible.
> >
> > or queue them up and send them when the drive is idle (you would need to
> > keep track to make sure the space isn't re-used)
> >
> > as an example, if you would consider spinning down a drive you don't hurt
> > performance by sending accumulated trim commands.
> >
> > David Lang
>
> An alternate approach is the block layer maintain its own bitmap of
> used unused sectors / blocks. Unmap commands from the filesystem just
> cause the bitmap to be updated. No other effect.
>
> (Big unknown: Where will the bitmap live between reboots? Require DM
> volumes so we can have a dedicated bitmap volume in the mix to store
> the bitmap to? Maybe on mount, the filesystem has to be scanned to
> initially populate the bitmap? Other options?)

I wouldn't really have it live anywhere. Discard is best effort; it's
not required for fs integrity. As long as we don't discard an in-use
block we're free to do anything else (including forget to discard,
rediscard a discarded block etc).

It is theoretically possible to run all of this from user space using
the fs mappings, a bit like a defrag command.

One other option would just be to scan on mount, discard everything
empty and redo on next mount ... this might be just the thing for
laptops.

> Assuming we have a persistent bitmap in place, have a background
> scanner that kicks in when the cpu / disk is idle. It just
> continuously scans the bitmap looking for contiguous blocks of unused
> sectors. Each time it finds one, it sends the largest possible unmap
> down the block stack and eventually to the device.
>
> When normal cpu / disk activity kicks in, this process goes to sleep.
>
> That way much of the smarts are concentrated in the block layer, not
> in the filesystem code. And it is being done when the disk is
> otherwise idle, so you don't have the ncq interference.
>
> Even laptop users should have enough idle cpu available to manage
> this. Enterprise would get the large discards it wants, and
> unmentioned in the previous discussion, mdraid gets the large discards
> it also wants.
>
> ie. If a mdraid raid5/raid6 volume is built of SSDs, it will only be
> able to discard a full stripe at a time. Otherwise the P=D1 ^ D2 logic
> is lost.
>
> Another benefit of the above is the code should be extremely safe and testable.

Actually, I think, if we go in-kernel, the discard might be better tied
into the block plugging mechanism. The real test might be no
outstanding commands and queue plugged, keep plugged and begin
discarding.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Greg KH: "Re: xterm loses data (pts regression?)"
Previous message: stephane eranian: "Re: perf_counters issue with PERF_SAMPLE_GROUP"
In reply to: Greg Freemyer: "Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)"
Next in thread: Richard Sharpe: "Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]