Re: [3.6 regression?] THP + migration/compaction livelock (I think)

From: Andy Lutomirski
Date: Tue Nov 13 2012 - 18:25:34 EST


On Tue, Nov 13, 2012 at 3:11 PM, David Rientjes <rientjes@xxxxxxxxxx> wrote:
> On Tue, 13 Nov 2012, Andy Lutomirski wrote:
>
>> I've seen an odd problem three times in the past two weeks. I suspect
>> a Linux 3.6 regression. I"m on 3.6.3-1.fc17.x86_64. I run a parallel
>> compilation, and no progress is made. All cpus are pegged at 100%
>> system time by the respective cc1plus processes. Reading
>> /proc/<pid>/stack shows either
>>
>> [<ffffffff8108e01a>] __cond_resched+0x2a/0x40
>> [<ffffffff8114e432>] isolate_migratepages_range+0xb2/0x620
>> [<ffffffff8114eba4>] compact_zone+0x144/0x410
>> [<ffffffff8114f152>] compact_zone_order+0x82/0xc0
>> [<ffffffff8114f271>] try_to_compact_pages+0xe1/0x130
>> [<ffffffff816143db>] __alloc_pages_direct_compact+0xaa/0x190
>> [<ffffffff81133d26>] __alloc_pages_nodemask+0x526/0x990
>> [<ffffffff81171496>] alloc_pages_vma+0xb6/0x190
>> [<ffffffff81182683>] do_huge_pmd_anonymous_page+0x143/0x340
>> [<ffffffff811549fd>] handle_mm_fault+0x27d/0x320
>> [<ffffffff81620adc>] do_page_fault+0x15c/0x4b0
>> [<ffffffff8161d625>] page_fault+0x25/0x30
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> or
>>
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>
> This reminds me of the thread at http://marc.info/?t=135102111800004 which
> caused Marc's system to reportedly go unresponsive like your report but in
> his case it also caused a reboot. If your system is still running (or,
> even better, if you're able to capture this happening in realtime), could
> you try to capture
>
> grep -E "compact_|thp_" /proc/vmstat
>
> as well while it is in progress? (Even if it's not happening right now,
> the data might still be useful if you have knowledge that it has occurred
> since the last reboot.)

It just happened again.

$ grep -E "compact_|thp_" /proc/vmstat
compact_blocks_moved 8332448774
compact_pages_moved 21831286
compact_pagemigrate_failed 211260
compact_stall 13484
compact_fail 6717
compact_success 6755
thp_fault_alloc 150665
thp_fault_fallback 4270
thp_collapse_alloc 19771
thp_collapse_alloc_failed 2188
thp_split 19600


/proc/meminfo:

MemTotal: 16388116 kB
MemFree: 6684372 kB
Buffers: 34960 kB
Cached: 6233588 kB
SwapCached: 29500 kB
Active: 4881396 kB
Inactive: 3824296 kB
Active(anon): 1687576 kB
Inactive(anon): 764852 kB
Active(file): 3193820 kB
Inactive(file): 3059444 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 16777212 kB
SwapFree: 16643864 kB
Dirty: 184 kB
Writeback: 0 kB
AnonPages: 2408692 kB
Mapped: 126964 kB
Shmem: 15272 kB
Slab: 635496 kB
SReclaimable: 528924 kB
SUnreclaim: 106572 kB
KernelStack: 3600 kB
PageTables: 39460 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 24971268 kB
Committed_AS: 5688448 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 614952 kB
VmallocChunk: 34359109524 kB
HardwareCorrupted: 0 kB
AnonHugePages: 1050624 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 3600384 kB
DirectMap2M: 11038720 kB
DirectMap1G: 1048576 kB

$ sudo ./perf stat -p 11764 -e
compaction:mm_compaction_isolate_migratepages,task-clock,vmscan:mm_vmscan_direct_reclaim_begin,vmscan:mm_vmscan_lru_isolate,vmscan:mm_vmscan_memcg_isolate
[sudo] password for luto:
^C
Performance counter stats for process id '11764':

1,638,009 compaction:mm_compaction_isolate_migratepages #
0.716 M/sec [100.00%]
2286.993046 task-clock # 0.872 CPUs utilized
[100.00%]
0 vmscan:mm_vmscan_direct_reclaim_begin # 0.000
M/sec [100.00%]
0 vmscan:mm_vmscan_lru_isolate # 0.000 M/sec
[100.00%]
0 vmscan:mm_vmscan_memcg_isolate # 0.000 M/sec

2.623626878 seconds time elapsed

/proc/zoneinfo:
Node 0, zone DMA
pages free 3972
min 16
low 20
high 24
scanned 0
spanned 4080
present 3911
nr_free_pages 3972
nr_inactive_anon 0
nr_active_anon 0
nr_inactive_file 0
nr_active_file 0
nr_unevictable 0
nr_mlock 0
nr_anon_pages 0
nr_mapped 0
nr_file_pages 0
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 0
nr_slab_unreclaimable 4
nr_page_table_pages 0
nr_kernel_stack 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 0
nr_dirtied 0
nr_written 0
numa_hit 1
numa_miss 0
numa_foreign 0
numa_interleave 0
numa_local 1
numa_other 0
nr_anon_transparent_hugepages 0
protection: (0, 2434, 16042, 16042)
pagesets
cpu: 0
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 1
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 2
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 3
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 4
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 5
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 6
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 7
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 8
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 9
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 10
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 11
count: 0
high: 0
batch: 1
vm stats threshold: 8
all_unreclaimable: 1
start_pfn: 16
inactive_ratio: 1
Node 0, zone DMA32
pages free 321075
min 2561
low 3201
high 3841
scanned 0
spanned 1044480
present 623163
nr_free_pages 321075
nr_inactive_anon 43450
nr_active_anon 203472
nr_inactive_file 5416
nr_active_file 39568
nr_unevictable 0
nr_mlock 0
nr_anon_pages 86455
nr_mapped 156
nr_file_pages 45195
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 6679
nr_slab_unreclaimable 419
nr_page_table_pages 2
nr_kernel_stack 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 9994
nr_vmscan_immediate_reclaim 1
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 1
nr_dirtied 1765256
nr_written 1763392
numa_hit 53134489
numa_miss 0
numa_foreign 0
numa_interleave 0
numa_local 53134489
numa_other 0
nr_anon_transparent_hugepages 313
protection: (0, 0, 13608, 13608)
pagesets
cpu: 0
count: 0
high: 186
batch: 31
vm stats threshold: 48
cpu: 1
count: 4
high: 186
batch: 31
vm stats threshold: 48
cpu: 2
count: 4
high: 186
batch: 31
vm stats threshold: 48
cpu: 3
count: 0
high: 186
batch: 31
vm stats threshold: 48
cpu: 4
count: 4
high: 186
batch: 31
vm stats threshold: 48
cpu: 5
count: 0
high: 186
batch: 31
vm stats threshold: 48
cpu: 6
count: 0
high: 186
batch: 31
vm stats threshold: 48
cpu: 7
count: 11
high: 186
batch: 31
vm stats threshold: 48
cpu: 8
count: 0
high: 186
batch: 31
vm stats threshold: 48
cpu: 9
count: 4
high: 186
batch: 31
vm stats threshold: 48
cpu: 10
count: 13
high: 186
batch: 31
vm stats threshold: 48
cpu: 11
count: 4
high: 186
batch: 31
vm stats threshold: 48
all_unreclaimable: 0
start_pfn: 4096
inactive_ratio: 4
Node 0, zone Normal
pages free 1343098
min 14318
low 17897
high 21477
scanned 0
spanned 3538944
present 3483648
nr_free_pages 1343098
nr_inactive_anon 147925
nr_active_anon 221736
nr_inactive_file 759336
nr_active_file 758833
nr_unevictable 0
nr_mlock 0
nr_anon_pages 257074
nr_mapped 31632
nr_file_pages 1529150
nr_dirty 25
nr_writeback 0
nr_slab_reclaimable 125552
nr_slab_unreclaimable 26176
nr_page_table_pages 9844
nr_kernel_stack 456
nr_unstable 0
nr_bounce 0
nr_vmscan_write 36224
nr_vmscan_immediate_reclaim 117
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 3815
nr_dirtied 51415788
nr_written 48993658
numa_hit 1081691700
numa_miss 0
numa_foreign 0
numa_interleave 25195
numa_local 1081691700
numa_other 0
nr_anon_transparent_hugepages 199
protection: (0, 0, 0, 0)
pagesets
cpu: 0
count: 156
high: 186
batch: 31
vm stats threshold: 64
cpu: 1
count: 177
high: 186
batch: 31
vm stats threshold: 64
cpu: 2
count: 159
high: 186
batch: 31
vm stats threshold: 64
cpu: 3
count: 161
high: 186
batch: 31
vm stats threshold: 64
cpu: 4
count: 146
high: 186
batch: 31
vm stats threshold: 64
cpu: 5
count: 98
high: 186
batch: 31
vm stats threshold: 64
cpu: 6
count: 59
high: 186
batch: 31
vm stats threshold: 64
cpu: 7
count: 54
high: 186
batch: 31
vm stats threshold: 64
cpu: 8
count: 40
high: 186
batch: 31
vm stats threshold: 64
cpu: 9
count: 32
high: 186
batch: 31
vm stats threshold: 64
cpu: 10
count: 46
high: 186
batch: 31
vm stats threshold: 64
cpu: 11
count: 57
high: 186
batch: 31
vm stats threshold: 64
all_unreclaimable: 0
start_pfn: 1048576
inactive_ratio: 11


--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/