Re: SMP kernel lockup, 2.2.14 and 2.2.15pre15

From: Patrick J. LoPresti (patl@cag.lcs.mit.edu)
Date: Fri Mar 31 2000 - 17:37:01 EST


I have finally reproduced my lockup on 2.2.14 with the IKD patches.
Here are the backtraces (sans arguments) for the two CPUs as reported
by kdb.

Backtrace for CPU 1:

  stext_lock + 0x5bb
  __wait_on_buffer + 0xd9
  sync_block + 0x9f
  sync_direct + 0x22
  ext2_sync_file + 0x4b
  sys_fsync + 0x85

Backtrace for CPU 0:

  add_timer + 0x3a
  tcp_send_delayed_ack + 0x34
  tcp_delack_timer + 0x3a
  timer_bh + 0x37a
  do_bottom_half + 0x89
  do_IRQ + 0x52
  common_interrupt + 0x18
  do_no_page + 0x42
  handle_mm_fault + 0x107
  do_page_fault + 0x12d
  error_code + 0x2d
  memcpy_toiovec + 0x38
  tcp_recvmsg + 0x377
  inet_recvmsg + 0x72
  sock_recvmsg + 0x37
  sock_read + 0x82
  sys_read + 0xc8

So one process is calling fsync() and the another is calling read() on
a TCP socket. It is not obvious to me why this is deadlocked.

When I do "go" and then hit Pause again, CPU 1 is always stuck at
exactly the same place. CPU 0 is also exactly the same except for the
most recent 5 or 6 frames; it seems like I always catch it while
handling the interrupt and attempting to send the delayed ack, which
then sets itself up to fire again a little later.

Note that "do_no_page + 0x42" is the instruction immediately following
a call to do_anonymous_page. I suspect do_anonymous_page is where I
am stuck, and the backtrace is being confused by the presence of the
interrupt. But I am not sure.

I am hoping a wizard can just look at these backtraces and see the
problem. Failing that, I would appreciate ideas for what to try next.

This crash is not easy to reproduce; this time it took almost a week
of continuously running the offending operations. The program which
elicits the crash is (unfortunately) commercial, so I do not have the
source. It runs entirely as an regular user, however, so this is
definitely a kernel bug.

I would be glad to provide any additional information (e.g., snippets
of disassembly) which would be useful.

Help, please?

 - Pat

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Fri Mar 31 2000 - 21:00:30 EST