Re: [PATCH v2][RESEND] seq_file: don't set read position for invalid iterator

From: Tomasz Majchrzak
Date: Thu Nov 24 2016 - 10:23:52 EST


On Wed, Nov 02, 2016 at 10:11:47AM +0100, Miklos Szeredi wrote:
> On Mon, Oct 31, 2016 at 10:32:21AM +0100, Tomasz Majchrzak wrote:
> > On Wed, Oct 26, 2016 at 11:17:13AM +0200, Miklos Szeredi wrote:
> > > On Wed, Oct 12, 2016 at 2:07 PM, Tomasz Majchrzak
> > > <tomasz.majchrzak@xxxxxxxxx> wrote:
> > > > If kernfs file is empty on a first read, successive read operations
> > > > using the same file descriptor will return no data, even when data is
> > > > available. Default kernfs 'seq_next' implementation advances iterator
> > > > position even when next object is not there. Kernfs 'seq_start' for
> > > > following requests will not return iterator as position is already on
> > > > the second object.
> > > >
> > > > This bug doesn't allow to monitor badblocks sysfs files from MD raid.
> > > > They are initially empty but if data appears at some stage, userspace is
> > > > not able to read it. It doesn't affect any released applications but it
> > > > is necessary for upcoming bad block support for external metadata in MD
> > > > raid.
> > >
> > > What is the expectation from the userspace application? Should
> > > seq_file support "tail -f" as more data is added to the file? AFAICS
> > > this patch doesn't address that generally, just the empty->nonempty
> > > transition (with the single record case).
> >
> > Yes, it fixes only empty-nonempty transition. I believe it already works
> > fine ("tail -f") if the file cursor is in the middle of the file,
>
> "tail -f" works if there are multiple records. It doesn't work for the "single"
> style of seq_file used by sysfs.
>
> > however
> > mdmon never performs such operation - it "acknowledges" entry read from the
> > start of one sysfs file (unacknowledged_bad_blocks) via other sysfs file
> > (bad_blocks) and reads the sysfs file again from the start (there is new
> > data at position 0 at this stage).
> >
> > My patch just resolves a regression introduced by a fix for multiple record
> > file: 4cdfe84b514 ("deal with the first call of ->show() generating no
> > output").
>
> The above commit (from 2.6.27) didn't change the beavior relative to sysfs
> files, AFAICS. Before that commit the index was incremented if the first
> iterator was empty, just like after the patch.

After the patch m->index is updated after 'next' iterator, regardless if valid
or invalid iterator has been returned. Prior to the patch there was no call to
'next' operator at all. The similar code few lines below ignores position
for invalid iterator.

> > It makes it consistent with rest of the code in seq_file
> > (seq_read, traverse) - new iterator position is ignored if valid iterator
> > has not been returned.
>
> Your patch doesn't seem to make the behavior consistent, it just special cases
> the last record being empty. What if the last two (three, etc) records are
> empty? It would be consistent if m->index always pointed to the first record
> matching the given file position, for example.

I don't understand it. Is it possible to map file position to the index
(record)? I think there is no way to determine record size without actually
reading it.

> But I doubt that we actually need to do that. For example just special casing
> the zero offset (always translating to zero index) would be conceptually simpler
> and equivalent to your patch for the sysfs case.

Do you mean such piece of code?

if (ppos == 0)
m->index = 0;

> But see below. I still don't see what you gain by not doing the open/read/close
> when polling the badblocks file.
>
> > > Why does userspace app not do open+read+close each time it polls the
> > > badblocks file?
> >
> > This excerpt from mdadm/mdmon-design.txt file explains it:
> >
> > "
> > The 'monitor' has the primary role of monitoring the array for
> > important state changes and updating the metadata accordingly. As
> > writes to the array can be blocked until 'monitor' completes and
> > acknowledges the update, it much be very careful not to block itself.
> > In particular it must not block waiting for any write to complete else
> > it could deadlock. This means that it must not allocate memory as
> > doing this can require dirty memory to be written out and if the
> > system choose to write to the array that mdmon is monitoring, the
> > memory allocation could deadlock.
> >
> > So 'monitor' must never allocate memory and must limit the number of
> > other system call it performs. It may:
> > - use select (or poll) to wait for activity on a file descriptor
> > - read from a sysfs file descriptor
> > - write to a sysfs file descriptor
> > - write the metadata out to the block devices using O_DIRECT
> > - send a signal (kill) to the manager thread
> >
> > It must not e.g. open files or do anything similar that might allocate
> > resources.
> > "
>
> seq_read() *will* allocate memory; see seq_buf_alloc().

mdmon (RAID monitoring application) opens and reads sysfs file during
initialization stage. At this stage RAID array is not enabled yet so memory
allocations are allowed. The first read doesn't process the data obtained,
it is just performed in order to trigger memory allocation. The file is not
being closed later in order to keep the buffer. The sysfs files used in such
manner are single-record and their length do not exceed page size so no
additional memory allocations are performed in next read operations.

Tomek