RE: [PATCH v2 0/3] support for broken memory modules (BadRAM)

From: Luck, Tony
Date: Fri Jun 24 2011 - 12:46:39 EST


> > I am very curious about your findings. Independently of those, I am in
> > favour of a patch that enables longer e820 tables if it has no further
> > impact on speed or space.
> >
>
> That is already in the mainline kernel, although only if fed from the
> boot loader (it was developed in the context of mega-NUMA machines); the
> stub fetching from INT 15h doesn't use this at the moment.

Does it scale? Current X86 systems go up to about 2TB - presumably
in the form of 256 8GB DIMMs (or maybe 512 4GB ones). If a faulty
row or column on a DIMM can give rise to 4K bad pages, then these
large systems could conceivably have 1-2 million bad pages (while
still being quite usable - a loss of 4-8G from a 2TB system is down
in the noise). Can we handle a 2 million entry e820 table? Do we
want to?

Perhaps we may end up with a composite solution. Use e820 to map out
the bad pages below some limit (like 4GB). Preferably in the boot loader
so it can find a range of good memory to load the kernel. Then use
badRAM patterns for addresses over 4GB for Linux to avoid bad pages
by flagging their page structures.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/