Re: [RFC PATCH v2 00/16] Hwpoison rework {hard,soft}-offline

From: HORIGUCHI NAOYA(堀口 直也)
Date: Mon Jun 15 2020 - 02:20:01 EST


Hi Dmitry,

On Thu, Jun 11, 2020 at 07:43:19PM +0300, Dmitry Yakunin wrote:
> Hello!
>
> We are faced with similar problems with hwpoisoned pages
> on one of our production clusters after kernel update to stable 4.19.
> Application that does a lot of memory allocations sometimes caught SIGBUS signal
> with message in dmesg about hardware memory corruption fault.
> In kernel and mce logs we saw messages about soft offlining pages with
> correctable errors. Those events always had happened before application
> was killed. This is not the behavior we expect. We want our application to
> continue working on a smaller set of available pages in the system.
>
> This issue is difficult to reproduce, but we suppose that the reason for such
> behavior is that compaction does not check for page poisonness while processing
> free pages, so as a result valid userspace data gets migrated to bad pages.
> We wrote the simple test:
> - soft offline first 4 pages in every 64 continuous pages in ZONE_NORMAL
> through writing pfn to /sys/devices/system/memory/soft_offline_page
> - force compaction by echo 1 >> /proc/sys/vm/compact_memory
> Without this patch series after these steps bash became unusable
> and every attempt to run any command leads to SIGBUS with message about
> hardware memory corruption fault. And after applying this series to our kernel
> tree we cannot reproduce such SIGBUSes by our test. On upstream kernel 5.7
> this behavior is still reproducible.
>
> So, we want to know, why this patchset wasn't merged to the upstream?
> Is there any problems in such rework for {soft,hard}-offline handling?

No technical reason, it's just because I didn't have enough power to push
this to be merged. Really sorry about that.

> BTW, this patchset should be updated with upstream changes in mm.

I'm working this now and still need more testing to confirm, but I hope
I'll update and post this for 5.9.

Thanks,
Naoya Horiguchi