Problems with current mlock() implementation

John Ripley (john@empeg.com)
Wed, 27 Oct 1999 16:55:24 +0100


I believe there are various problems with the way mlock() and mlockall()
are handled:

- Locking is intentionally limited (in mm/mlock.c) to at most half of
physical memory. In certain situations it is necessary to lock as much
of memory as possible. There doesn't appear to be a reason why you can't
lock all of memory minus about 128KB.

- Memory mapped with PROT_NONE is paged in when VM_LOCKED. This causes
problems with glibc's implementation of threads. To quote glibc 2.1
malloc.c:

/* A memory region aligned to a multiple of HEAP_MAX_SIZE is needed.
No swap space needs to be reserved for the following large
mapping (on Linux, this is the case for all non-writable mappings
anyway). */
p1 = (char *)MMAP(0, HEAP_MAX_SIZE<<1, PROT_NONE, \
MAP_PRIVATE|MAP_NORESERVE);

Except that when VM_LOCKED is in effect, the check in the kernel is:
(from mm/mlock.c)

/* keep track of amount of locked VM */
pages = (end - start) >> PAGE_SHIFT;
if (newflags & VM_LOCKED) {
pages = -pages;
make_pages_present(start, end);
}
vma->vm_mm->locked_vm -= pages;

The make_pages_present() pages in 2MB of mapped memory regardless of
the fact it's PROT_NONE. glibc then unmaps everything not aligned to
1MB and mprotect(PROT_READ|PROT_WRITE)'s the required amount (about
32KB). If anything, the step of actually paging memory in and checking
availability should be performed in mprotect. glibc seems fine with
mprotect() returning ENOMEM.

- page_cache_size is assumed to consist of blocks that can always be
paged out, even if they are VM_LOCKED. If a mapped block is mlock()'d,
this gives vm_enough_memory() (in mmap.c) the incorrect indication as
to how much memory is actually available. Again, mapped blocks with
PROT_NONE are still paged in.

The upshot of all of this is that using mlockall(MCL_CURRENT |
MCL_FUTURE) is more wasteful of memory than need be. Because the
libraries and binary are also locked, and the count of free pages is
incorrect, and as such more memory than available can be
allocated. This causes segfaults in the same way that overcomit does
when allocated memory is then accessed.

Possible fixes:

- There seems to be no reason to not allow most of memory being
locked. The only adverse effect seems to be read() and open() calls
failing with ENOMEM, which is acceptable. Alternatively, about 128KB
or so could be reserved, although this is rather arbitrary.

- When VM_LOCKED is in effect for a region, PROT_NONE needs to be
checked for on mlock(), mmap() and mprotect(), with memory being paged
in or released to reflect accessibility.

- Fixing the count of pageable blocks is not so simple. VM_LOCKED
blocks are still cacheable, they're just not pageable. There might
need to be another variable storing pageable_count.

I haven't attached a patch as I'm not entirely sure what the best
approach is, or whether behaviour is considered correct. I have in place a
few hacks which seem to work correctly when memory is exhausted - accesses
never cause a segfault if they previously succeeded as an mmap() or brk().

--
John Ripley, empeg Ltd.
http://www.empeg.com

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/