RE: [Lse-tech] RE: [PATCH] HUGETLB memory commitment

From: Chen, Kenneth W
Date: Mon Apr 05 2004 - 12:03:14 EST


>>>> Ray Bryant wrote on Mon, April 05, 2004 8:27 AM
> Chen, Kenneth W wrote:
>
> >
> > A simple counter won't work for different file offset mapping. It has to
> > be some sort of per-inode, per-block reservation tracking. I think we are
> > steering in the right direction though.
> >
> >
>
> OK, pardon my question about test code, that is trivial enough I guess.
>
> Anyway, the only way I can see to make this work with non-zero offset is to
> hang a list of segment descriptors (offset and size) for each reserved segment
> off of the inode. Then when a new mapping comes in, we search the segment
> list to see if the new offset and size overlaps with any of the existing
> reserved segments. If it doesn't, then we make a new reservation (and request
> file system quota) for the current size, and add the current request to the
> reserved segment list. If it does, and it fits entirely in a previously
> reserved segement, then no change to reservation/quota needs to be made. If
> it only partially fits, then we need to make a new reservation/quota request
> for the number of new huge pages required and update the overlapping segment's
> length to reflect the new reservation.
>
> Then in truncate_hugepages() we can search the segment list again, discarding
> full or partial segments that occur either entirely or partially beyond
> "lstart", as appropropriate and doing hugetlb_unreserve() and
> hugetlbfs_put_quota() for the appropriate number of pages.
>
> This will be quite a bit of code and complexity. Do we still think this is
> all worth it to follow Andrew's suggestion of no API changes for "allocate on
> fault" hugetlbpages? It would be a lot cleaner just to return SIGBUS if we
> run out of hugepages and be done with it, in spite of the API change.
>
> Is there a simpler way to do the correct reservation? (One could allocate the
> pages at mmap() time, resurrecting hugetlb_prefault(), but zero the pages at
> fault time, this would solve the original problem we ran into at SGI, but
> would not solve Andi's requirement to postpone allocation so NUMA API's can
> control placement.)

I actually started coding yesterday. It doesn't look too bad (I think). I will
post it once I finished it up later today or tomorrow.

There are still some oddity in lifetime of the huge page reservation, but that
can be discussed once everyone sees the code.

- Ken


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/