Re: [discuss] Re: 32-bit dma allocations on 64-bit platforms

From: William Lee Irwin III
Date: Fri Jun 25 2004 - 01:18:15 EST


/*
On Thu, Jun 24, 2004 at 04:09:45PM -0700, Andrew Morton wrote:
>> A testcase would be, on a 2G box:
>> a) free up as much memory as you can
>> b) write a 1.2G file to fill highmem with pagecache
>> c) malloc(800M), bzero(), sleep
>> d) swapoff -a
>> You now have a box which has almost all of lowmem pinned in anonymous
>> memory. It'll limp along and go oom fairly easily.
>> Another testcase would be:
>> a) free up as much memory as you can
>> b) write a 1.2G file to fill highmem with pagecache
>> c) malloc(800M), mlock it
>> You now have most of lowmem mlocked.

On Thu, Jun 24, 2004 at 04:15:49PM -0700, William Lee Irwin III wrote:
> These are approximately identical to the testcases I had in mind, except
> neither of these is truly specific to 2GB and can have the various magic
> numbers calculated from sysconf() and/or meminfo.

It seems that glibc is fucking with sysinfo or something; hackish
workaround was to call sysconf(_SC_PAGESIZE) by hand for where mem_unit
would otherwise be needed and to treat the screwed-with sysinfo fields
as being in opaque units. Blame Uli.

At any rate, the result of running this with no swap online appears to
be that this just results in OOM kills whenever enough lowmem is needed.
This is expected, as the anonymous allocations aren't mlocked, so with
swap online, they would merely be swapped out, and with swap offline,
the nr_swap_pages deadlock is no longer possible (the nr_swap_pages fix
wasn't in place for this testing). Something more sophisticated may
have worse effects.

However, there were apparent oddities with premature failures of vma
allocations and piss poor vma merging observed. For instance, the
sbrk()/mmap() changeover logic to fall back on a per-iteration basis is
largely because sticking to mmap() and then changing over to sbrk()
when it fails switches over prematurely, and so failed to sufficiently
utilize lowmem. The failures to find the free areas for the vmas went
away after alternating between sbrk() and mmap(). Also, the 64KB
mmap()'s of the file aren't merged at all, despite being very very
blatantly sequential. I'll look into this.

The strategy of mmap()'ing locked pagecache is useless for PAE boxen in
general and so things should be taught to, say, mount ramfs, allocate
ramfs pagecache to fill highmem, and then go on to mmap() instead of
fiddling around mmap()'ing and mlock()'ing pagecache. I can implement
this if it's deemed necessary to have the testcase extensible to PAE.

The results are mixed. It's not clear that this behavior is
pathological, at least not in the manner Andrea described. It is,
however, easy to trigger workload failure as opposed to kernel deadlock.
It may help to clarify the general position on that kind of issue so I
know how and whether that should be addressed.

$ cat /proc/meminfo
MemTotal: 1032988 kB
MemFree: 106684 kB
Buffers: 3804 kB
Cached: 16256 kB
SwapCached: 0 kB
Active: 897104 kB
Inactive: 2708 kB
HighTotal: 130816 kB
HighFree: 101388 kB
LowTotal: 902172 kB
LowFree: 5296 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 108 kB
Writeback: 0 kB
Mapped: 881912 kB
Slab: 18276 kB
Committed_AS: 911496 kB
PageTables: 1896 kB
VmallocTotal: 114680 kB
VmallocUsed: 2160 kB
VmallocChunk: 105244 kB
$ cat /proc/buddyinfo
Node 0, zone DMA 0 0 1 1 1 0 1 1 1 0 0
Node 0, zone Normal 56 14 59 2 3 0 1 1 1 0 0
Node 0, zone HighMem 777 315 349 360 505 236 61 1 0 0 0
*/

#define _GNU_SOURCE
#define _FILE_OFFSET_BITS 64

#include <unistd.h>
#include <stdlib.h>
#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/sysinfo.h>

#define LENGTH_STEP ((off64_t)pagesize << 4)
#define MAX_RETRIES 64
#ifdef DEBUG
#define dprintf(fmt, arg...) printf(fmt,##arg)
#else
#define dprintf(fmt, arg...) do { } while (0)
#endif

#define die() \
do { \
fprintf(stderr, "failure %s (%d) at %s:%d\n", \
strerror(errno), errno, __FILE__, __LINE__); \
fflush(stderr); \
sleep(60); \
exit(EXIT_FAILURE); \
} while (0)

int main(void)
{
struct sysinfo info;
char namebuf[64] = "/tmp/zoneDoS_XXXXXX";
int i, fd, retries;
off64_t len = 0;
unsigned long *first, *last, *p, *first_buf, *last_buf, *q;
unsigned long freehigh, freelow;
long pagesize;

first = last = NULL;
first_buf = last_buf = NULL;
if ((pagesize = sysconf(_SC_PAGESIZE)) < 0)
die();
if ((fd = mkstemp(namebuf)) < 0)
die();
if (unlink(namebuf))
die();

if (sysinfo(&info))
die();
retries = freehigh = 0;
while (info.freehigh && retries < MAX_RETRIES) {
if (ftruncate64(fd, len + LENGTH_STEP))
die();
p = mmap(NULL, LENGTH_STEP, PROT_READ|PROT_WRITE, MAP_SHARED, fd, len);
if (p == MAP_FAILED)
die();
len += LENGTH_STEP;
if (mlock(p, LENGTH_STEP))
die();
*p = 0;
if (last)
*last = (unsigned long)p;
last = p;
if (!first)
first = p;
freehigh = info.freehigh;
if (sysinfo(&info))
die();
if (info.freehigh >= freehigh)
retries++;
else
retries = 0;
dprintf("allocated %lu kB, freehigh = %lu kB\n",
(unsigned long)(len >> 10),
(unsigned long)(info.freehigh >> 10));
}

if (sysinfo(&info))
die();
retries = freelow = 0;
while (info.freeram - info.freehigh && retries < MAX_RETRIES) {
q = mmap(NULL, LENGTH_STEP, PROT_READ|PROT_WRITE, MAP_ANONYMOUS, 0, 0);
if (q == MAP_FAILED)
q = sbrk(LENGTH_STEP);
if (q == MAP_FAILED) {
sleep(1);
++retries;
continue;
}
for (i = 0; i < LENGTH_STEP/sizeof(*q); i += pagesize/sizeof(*q))
q[i + 1] = 1;
*q = 0;
if (last_buf)
*last_buf = (unsigned long)q;
last_buf = q;
if (!first_buf)
first_buf = q;
freelow = info.freeram - info.freehigh;
if (sysinfo(&info))
die();
if (info.freeram - info.freehigh >= freelow)
++retries;
else
retries = 0;
dprintf("freelow = %lu kB\n", (info.freeram - info.freehigh) >> 10);
}

dprintf("done allocating anonymous memory, freeing pagecache\n");
while (first) {
p = first;
first = (unsigned long *)(*first);
if (munmap(p, LENGTH_STEP))
die();
}
close(fd);
pause();
return EXIT_SUCCESS;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/