Re: [PATCH v6] hugetlb: Add hugetlb.*.numa_stat file

From: Marco Elver
Date: Tue Nov 16 2021 - 16:48:01 EST


On Tue, Nov 16, 2021 at 12:59PM -0800, Shakeel Butt wrote:
> On Tue, Nov 16, 2021 at 12:48 PM Mina Almasry <almasrymina@xxxxxxxxxx> wrote:
[...]
> > > Per above, probably unlikely, but allowed. WRITE_ONCE should prevent it,
> > > and at least relieve you to not worry about it (and shift the burden to
> > > WRITE_ONCE's implementation).
> > >
> >
> > Thank you very much for the detailed response. I can add READ_ONCE()
> > at the no-lock read site, that is no issue.
> >
> > However, for the writes that happen while holding the lock, the write
> > is like so:
> > + h_cg->nodeinfo[page_to_nid(page)]->usage[idx] += nr_pages;
> >
> > And like so:
> > + h_cg->nodeinfo[page_to_nid(page)]->usage[idx] -= nr_pages;
> >
> > I.e. they are increments/decrements. Sorry if I missed it but I can't
> > find an INC_ONCE(), and it seems wrong to me to do something like:
> >
> > + WRITE_ONCE(h_cg->nodeinfo[page_to_nid(page)]->usage[idx],
> > +
> > h_cg->nodeinfo[page_to_nid(page)] + nr_pages);

>From what I gather there are no concurrent writers, right?

WRITE_ONCE(a, a + X) is perfectly fine. What it says is that you can
have concurrent readers here, but no concurrent writers (and KCSAN will
still check that). Maybe we need a more convenient macro for this idiom
one day..

Though I think for something like

h_cg->nodeinfo[page_to_nid(page)]->usage[idx] += nr_pages;

it seems there wants to be an temporary long* so that you could write
WRITE_ONCE(*usage, *usage + nr_pages) or something.

> > I know we're holding a lock anyway so there is no race, but to the
> > casual reader this looks wrong as there is a race between the fetch of
> > the value and the WRITE_ONCE(). What to do here? Seems to me the most
> > reasonable thing to do is just READ_ONCE() and leave the write plain?
> >
> >
>
> How about atomic_long_t?

That would work of course; if this is very hot path code it might be
excessive if you don't have concurrent writers.

Looking at the patch in more detail, the counter is a stat counter that
can be read from a stat file, correct? Hypothetically, what would happen
if the reader of 'usage' reads approximate values?

If the answer is "nothing", then this could classify as an entirely
"benign" data race and you could only use data_race() on the reader and
leave the writers unmarked using normal +=/-=. Check if it fits
"Data-Racy Reads for Approximate Diagnostics" [1].

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/Documentation/access-marking.txt#n74