Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...

From: Minchan Kim
Date: Mon Jul 23 2018 - 21:03:53 EST


Hi Tino,

Thanks for the report.

On Mon, Jul 23, 2018 at 02:29:32PM +0200, Tino Lehnig wrote:
> Hello,
>
> after enabling the writeback feature in zram, I encountered the kernel bug
> below with heavy swap utilization. There is one specific workload that
> triggers the bug reliably and that is running Windows in KVM while
> overcommitting memory. The Windows VMs would fill all allocated memory with
> zero pages while booting. A few seconds after the host hits zram swap, the
> console on the host is flooded with the bug message. A few more seconds
> later I also encountered filesystem errors on the host causing the root
> filesystem to be mounted read-only. The filesystem errors do not occur when
> leaving RAM available for the host OS by limiting physical memory of the
> QEMU processes via cgroups.
>
> I started three KVM instances with the following commands in my tests. Any
> Windows ISO or disk image can be used. Less instances and smaller allocated
> memory will also trigger the bug as long as swapping occurs. The type of
> writeback device does not seem to matter. I have tried a SATA SSD and an
> NVMe Optane drive so far. My test machine has 256 GB of RAM and one CPU. I
> saw the same behavior on another machine with two CPUs and 128 GB of RAM.
>
> The bug does not occur when using zram as swap without "backing_dev" being
> set, but I had even more severe problems when running the same test on
> Ubuntu Kernels 4.15 and 4.17. Regardless of the writeback feature being used
> or not, the host would eventually lock up entirely when swap is in use on
> zram. The lockups may not be related directly to zram though and were
> apparently fixed in 4.18. I had absolutely no problems on Ubuntu Kernel 4.13
> either, before the writeback feature was introduced.

We didn't release v4.18 yet. Could you say what kernel tree/what version
you used?

Now I don't have enough time to dig in.

Sergey, I really appreciate if you could have availabe time to look into.
Anyway, I could try to see it asap if Sergey is not available.
No worry.

Thanks.


>
> Thank you for your attention.
>
> --
>
> commands used:
>
> modprobe zram
> echo 1 > /sys/block/zram0/reset
> echo lz4 > /sys/block/zram0/comp_algorithm
> echo /dev/nvme0n1 > /sys/block/zram0/backing_dev
> echo 256G > /sys/block/zram0/disksize
> mkswap /dev/zram0
> swapon /dev/zram0
>
> kvm -nographic -smp 20 -m 131072 -cdrom winpe.iso
>
> --
>
> log message:
>
> BUG: Bad page state in process qemu-system-x86 pfn:3dfab21
> page:ffffdfb137eac840 count:0 mapcount:0 mapping:0000000000000000 index:0x1
> flags: 0x17fffc000000008(uptodate)
> raw: 017fffc000000008 dead000000000100 dead000000000200 0000000000000000
> raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
> page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
> bad because of flags: 0x8(uptodate)
> Modules linked in: lz4 lz4_compress zram zsmalloc intel_rapl sb_edac
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel bin
> fmt_misc pcbc aesni_intel aes_x86_64 crypto_simd cryptd iTCO_wdt glue_helper
> iTCO_vendor_support intel_cstate lpc_ich mei_me intel_uncore intel_rapl_perf
> pcspkr joydev sg mfd_core ioatdma mei wmi evdev ipmi_si ipmi_devintf
> ipmi_msghandler
> acpi_power_meter acpi_pad button ip_tables x_tables autofs4 ext4
> crc32c_generic crc16 mbcache jbd2 fscrypto hid_generic usbhid hid sd_mod
> xhci_pci ehci_pci ahci libahci xhci_hcd ehci_hcd libata igb i2c_algo_bit
> crc32c_intel scsi_mod i2c_i8
> 01 dca usbcore
> CPU: 4 PID: 1039 Comm: qemu-system-x86 Tainted: G B 4.18.0-rc5+ #1
> Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0b 05/02/2017
> Call Trace:
> dump_stack+0x5c/0x7b
> bad_page+0xba/0x120
> get_page_from_freelist+0x1016/0x1250
> __alloc_pages_nodemask+0xfa/0x250
> alloc_pages_vma+0x7c/0x1c0
> do_swap_page+0x347/0x920
> ? __update_load_avg_se.isra.38+0x1eb/0x1f0
> ? cpumask_next_wrap+0x3d/0x60
> __handle_mm_fault+0x7b4/0x1110
> ? update_load_avg+0x5ea/0x720
> handle_mm_fault+0xfc/0x1f0
> __get_user_pages+0x12f/0x690
> get_user_pages_unlocked+0x148/0x1f0
> __gfn_to_pfn_memslot+0xff/0x3c0 [kvm]
> try_async_pf+0x87/0x230 [kvm]
> tdp_page_fault+0x132/0x290 [kvm]
> ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
> kvm_mmu_page_fault+0x74/0x570 [kvm]
> ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
> ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
> ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
> ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
> ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
> ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
> ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
> ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
> ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
> ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
> ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
> ? vmx_vcpu_run+0x375/0x620 [kvm_intel]
> kvm_arch_vcpu_ioctl_run+0x9b3/0x1990 [kvm]
> ? __update_load_avg_se.isra.38+0x1eb/0x1f0
> ? kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
> kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
> ? __switch_to+0x395/0x450
> ? __switch_to+0x395/0x450
> do_vfs_ioctl+0xa2/0x630
> ? __schedule+0x3fd/0x890
> ksys_ioctl+0x70/0x80
> ? exit_to_usermode_loop+0xca/0xf0
> __x64_sys_ioctl+0x16/0x20
> do_syscall_64+0x55/0x100
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7fb30361add7
> Code: 00 00 00 48 8b 05 c1 80 2b 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff
> ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff
> 73 01 c3 48 8b 0d 91 80 2b 00 f7 d8 64 89 01 48
> RSP: 002b:00007fb2e97f98b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007fb30361add7
> RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000015
> RBP: 00005652b984e0f0 R08: 00005652b7d513d0 R09: 0000000000000001
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> R13: 00007fb308c66000 R14: 0000000000000000 R15: 00005652b984e0f0
>
> --
>
> ver_linux: Debian 9.5 with Kernel 4.18.0-rc5+
>
> GNU C 6.3.0
> GNU Make 4.1
> Binutils 2.28
> Util-linux 2.29.2
> Mount 2.29.2
> Module-init-tools 23
> E2fsprogs 1.43.4
> Linux C Library 2.24
> Dynamic linker (ldd) 2.24
> Linux C++ Library 6.0.22
> Procps 3.3.12
> Kbd 2.0.3
> Console-tools 2.0.3
> Sh-utils 8.26
> Udev 232
>
> --
>
> cpuinfo:
>
> processor : 0
> vendor_id : GenuineIntel
> cpu family : 6
> model : 79
> model name : Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> stepping : 1
> microcode : 0xb000021
> cpu MHz : 1200.632
> cache size : 25600 KB
> physical id : 0
> siblings : 20
> core id : 0
> cpu cores : 10
> apicid : 0
> initial apicid : 0
> fpu : yes
> fpu_exception : yes
> cpuid level : 20
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
> pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb
> rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
> nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est
> tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt
> tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch
> cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin tpr_shadow vnmi
> flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms
> invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc
> cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
> bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass
> bogomips : 4400.00
> clflush size : 64
> cache_alignment : 64
> address sizes : 46 bits physical, 48 bits virtual
> power management:
>
> --
> Kind regards,
>
> Tino Lehnig