Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

From: John Hubbard
Date: Wed Nov 29 2017 - 00:14:31 EST


On 11/28/2017 12:12 AM, Michal Hocko wrote:
> On Mon 27-11-17 15:26:27, John Hubbard wrote:
> [...]
>> Let me add a belated report, then: we ran into this limit while implementing
>> an early version of Unified Memory[1], back in 2013. The implementation
>> at the time depended on tracking that assumed "one allocation == one vma".
>
> And you tried hard to make those VMAs really separate? E.g. with
> prot_none gaps?

We didn't do that, and in fact I'm probably failing to grasp the underlying
design idea that you have in mind there...hints welcome...

What we did was to hook into the mmap callbacks in the kernel driver, after
userspace mmap'd a region (via a custom allocator API). And we had an ioctl
in there, to connect up other allocation attributes that couldn't be passed
through via mmap. Again, this was for regions of memory that were to be
migrated between CPU and device (GPU).

>
>> So, with only 64K vmas, we quickly ran out, and changed the design to work
>> around that. (And later, the design was *completely* changed to use a separate
>> tracking system altogether). exag
>>
>> The existing limit seems rather too low, at least from my perspective. Maybe
>> it would be better, if expressed as a function of RAM size?
>
> Dunno. Whenever we tried to do RAM scaling it turned out a bad idea
> after years when memory grown much more than the code author expected.
> Just look how we scaled hash table sizes... But maybe you can come up
> with something clever. In any case tuning this from the userspace is a
> trivial thing to do and I am somehow skeptical that any early boot code
> would trip over the limit.
>

I agree that this is not a limit that boot code is likely to hit. And maybe
tuning from userspace really is the right approach here, considering that
there is a real cost to going too large.

Just philosophically here, hard limits like this seem a little awkward if they
are set once in, say, 1999 (gross exaggeration here, for effect) and then not
updated to stay with the times, right? In other words, one should not routinely
need to tune most things. That's why I was wondering if something crude and silly
would work, such as just a ratio of RAM to vma count. (I'm more just trying to
understand the "rules" here, than to debate--I don't have a strong opinion
on this.)

The fact that this apparently failed with hash tables is interesting, I'd
love to read more if you have any notes or links. I spotted a 2014 LWN article
( https://lwn.net/Articles/612100 ) about hash table resizing, and some commits
that fixed resizing bugs, such as

12311959ecf8a ("rhashtable: fix shift by 64 when shrinking")

...was it just a storm of bugs that showed up?


thanks,
John Hubbard