Re: [RFC 0/3] hugetlbfs: optionally reserve all fs pages at mount time

From: Mike Kravetz
Date: Fri Mar 06 2015 - 13:59:17 EST

Next message: Peter Hurley: "Re: [PATCH v3 2/3] of: add optional options parameter to of_find_node_by_path()"
Previous message: Azael Avalos: "Re: [PATCH 0/3] toshiba_acpi: Hotkey handling and keymap updates"
In reply to: Michal Hocko: "Re: [RFC 0/3] hugetlbfs: optionally reserve all fs pages at mount time"
Next in thread: David Rientjes: "Re: [RFC 0/3] hugetlbfs: optionally reserve all fs pages at mount time"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 03/06/2015 07:10 AM, Michal Hocko wrote:

On Mon 02-03-15 17:18:14, Mike Kravetz wrote:

On 03/02/2015 03:10 PM, Andrew Morton wrote:

On Fri, 27 Feb 2015 14:58:08 -0800 Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote:

hugetlbfs allocates huge pages from the global pool as needed. Even if
the global pool contains a sufficient number pages for the filesystem
size at mount time, those global pages could be grabbed for some other
use. As a result, filesystem huge page allocations may fail due to lack
of pages.

Well OK, but why is this a sufficiently serious problem to justify
kernel changes? Please provide enough info for others to be able
to understand the value of the change.

Thanks for taking a look.

Applications such as a database want to use huge pages for performance
reasons. hugetlbfs filesystem semantics with ownership and modes work
well to manage access to a pool of huge pages. However, the application
would like some reasonable assurance that allocations will not fail due
to a lack of huge pages. Before starting, the application will ensure
that enough huge pages exist on the system in the global pools. What
the application wants is exclusive use of a pool of huge pages.

One could argue that this is a system administration issue. The global
huge page pools are only available to users with root privilege.
Therefore, exclusive use of a pool of huge pages can be obtained by
limiting access. However, many applications are installed to run with
elevated privilege to take advantage of resources like huge pages. It
is quite possible for one application to interfere another, especially
in the case of something like huge pages where the pool size is mostly
fixed.

Suggestions for other ways to approach this situation are appreciated.
I saw the existing support for "reservations" within hugetlbfs and
thought of extending this to cover the size of the filesystem.

Maybe I do not understand your usecase properly but wouldn't hugetlb
cgroup (CONFIG_CGROUP_HUGETLB) help to guarantee the same? Just
configure limits for different users/applications (inside different
groups) so that they never overcommit the existing pool. Would that work
for you?

Thanks for the CONFIG_CGROUP_HUGETLB suggestion, however I do not
believe this will be a satisfactory solution for my usecase. As you
point out, cgroups could be set up (by a sysadmin) for every hugetlb
user/application. In this case, the sysadmin needs to have knowledge
of every huge page user/application and configure appropriately.

I was approaching this from the point of view of the application. The
application wants the guarantee of a minimum number of huge pages,
independent of other users/applications. The "reserve" approach allows
the application to set aside those pages at initialization time. If it
can not get the pages it needs, it can refuse to start, or configure
itself to use less, or take other action.

As you point out, the cgroup approach could also provide guarantees to
the application if set up properly. I was trying for an approach that
would provide more control to the application independent of the
sysadmin and other users/applications.

--
Mike Kravetz
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Peter Hurley: "Re: [PATCH v3 2/3] of: add optional options parameter to of_find_node_by_path()"
Previous message: Azael Avalos: "Re: [PATCH 0/3] toshiba_acpi: Hotkey handling and keymap updates"
In reply to: Michal Hocko: "Re: [RFC 0/3] hugetlbfs: optionally reserve all fs pages at mount time"
Next in thread: David Rientjes: "Re: [RFC 0/3] hugetlbfs: optionally reserve all fs pages at mount time"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]