I have a blaze of 353 page allocation failures, all alike

From: Peter Kruse
Date: Thu Feb 10 2011 - 10:03:37 EST


Hello,

today one of our servers went berserk and produced literally 353
page allocation failures in 7 minutes until it was reset
(sysrq was still working). I attach one of them as an example.
The failures happened for different processes ranging from
sshd, top, java, tclsh, ypserv, smbd, portmap, kswapd to Xvnc4.
I already reported about an incidence with this server here:
https://lkml.org/lkml/2011/1/19/145
we have set vm.min_free_kbytes = 2097152 but the problem
obviously did not go away.
All traces start with one of these three beginnings:

Call Trace:
<IRQ> [<ffffffff81071f46>] __alloc_pages_nodemask+0x5ca/0x600
[<ffffffff81344127>] ? skb_dma_map+0xd2/0x23f

Call Trace:
<IRQ> [<ffffffff81071f46>] __alloc_pages_nodemask+0x5ca/0x600
[<ffffffff8109428b>] kmem_getpages+0x5c/0x127

Call Trace:
<IRQ> [<ffffffff81071f46>] __alloc_pages_nodemask+0x5ca/0x600
[<ffffffffa01418fd>] ? tcp_packet+0xc87/0xcb2 [nf_conntrack]

Please anybody, what is the cause of these failures?

Thanks,

Peter Call Trace:
<IRQ> [<ffffffff81071f46>] __alloc_pages_nodemask+0x5ca/0x600
[<ffffffff8109428b>] kmem_getpages+0x5c/0x127
[<ffffffff81094475>] fallback_alloc+0x11f/0x195
[<ffffffff81094614>] ____cache_alloc_node+0x129/0x138
[<ffffffff81094fdd>] kmem_cache_alloc+0xd1/0xfe
[<ffffffff8133c2f9>] sk_prot_alloc+0x2c/0xcd
[<ffffffff8133c427>] sk_clone+0x1b/0x24b
[<ffffffff81369ce2>] inet_csk_clone+0x13/0x81
[<ffffffff8137d698>] tcp_create_openreq_child+0x1d/0x39c
[<ffffffff8137c309>] tcp_v4_syn_recv_sock+0x57/0x1bc
[<ffffffff8137d50f>] tcp_check_req+0x210/0x37c
[<ffffffffa0154423>] ? ipv4_confirm+0x161/0x179 [nf_conntrack_ipv4]
[<ffffffff8137ba63>] tcp_v4_do_rcv+0xc1/0x1d7
[<ffffffff8137c021>] tcp_v4_rcv+0x4a8/0x739
[<ffffffff8135ba27>] ? nf_hook_slow+0x63/0xc3
[<ffffffff81361bb0>] ? ip_local_deliver_finish+0x0/0x1d0
[<ffffffff81361ca8>] ip_local_deliver_finish+0xf8/0x1d0
[<ffffffff81361df2>] ip_local_deliver+0x72/0x7a
[<ffffffff813618ac>] ip_rcv_finish+0x33c/0x356
[<ffffffff81361b79>] ip_rcv+0x2b3/0x2ea
[<ffffffff813a2861>] ? packet_rcv_spkt+0x10f/0x11a
[<ffffffff8134660a>] netif_receive_skb+0x2cb/0x2ed
[<ffffffff81346767>] napi_skb_finish+0x28/0x40
[<ffffffff81346ba5>] napi_gro_receive+0x2a/0x2f
[<ffffffffa001669d>] igb_poll+0x507/0x86a [igb]
[<ffffffffa0015ef8>] ? igb_clean_tx_irq+0x1dd/0x47b [igb]
[<ffffffff81346cb6>] net_rx_action+0xa7/0x178
[<ffffffff8103bd21>] __do_softirq+0x96/0x119
[<ffffffff8100bf5c>] call_softirq+0x1c/0x28
[<ffffffff8100d9e7>] do_softirq+0x33/0x6b
[<ffffffff8103b844>] irq_exit+0x36/0x38
[<ffffffff8100d0e9>] do_IRQ+0xa3/0xba
[<ffffffff8100b7d3>] ret_from_intr+0x0/0xa
<EOI> [<ffffffffa00f046f>] ? xfs_reclaim_inode_shrink+0xc3/0x112 [xfs]
[<ffffffffa00f0451>] ? xfs_reclaim_inode_shrink+0xa5/0x112 [xfs]
[<ffffffffa00f04bd>] ? xfs_reclaim_inode_shrink+0x111/0x112 [xfs]
[<ffffffff810770fc>] ? shrink_slab+0xd2/0x154
[<ffffffff81077e00>] ? try_to_free_pages+0x221/0x31c
[<ffffffff81074f4a>] ? isolate_pages_global+0x0/0x1f0
[<ffffffff81071d79>] ? __alloc_pages_nodemask+0x3fd/0x600
[<ffffffff8109428b>] ? kmem_getpages+0x5c/0x127
[<ffffffff81094475>] ? fallback_alloc+0x11f/0x195
[<ffffffff81094614>] ? ____cache_alloc_node+0x129/0x138
[<ffffffff810a9055>] ? pollwake+0x0/0x5b
[<ffffffff810946bf>] ? kmem_cache_alloc_node+0x9c/0xc7
[<ffffffff8109472d>] ? __kmalloc_node+0x43/0x45
[<ffffffff81340625>] ? __alloc_skb+0x6b/0x164
[<ffffffff8133bcc1>] ? sock_alloc_send_pskb+0xdd/0x31c
[<ffffffff8133bf10>] ? sock_alloc_send_skb+0x10/0x12
[<ffffffff8139e4c2>] ? unix_stream_sendmsg+0x180/0x312
[<ffffffff81338270>] ? sock_aio_write+0x109/0x122
[<ffffffff8100b7ce>] ? common_interrupt+0xe/0x13
[<ffffffff8109a41a>] ? do_sync_write+0xe7/0x12d
[<ffffffff81049208>] ? autoremove_wake_function+0x0/0x38
[<ffffffff8100b7ce>] ? common_intreclaimable:78357
mapped:11679 shmem:26799 pagetables:13497 bounce:0