Re: Fwd: Persistent rt_sigreturn segfaults on KVM VMs after upgrade to 5.15

From: Sean Christopherson
Date: Thu May 18 2023 - 11:02:58 EST


On Thu, May 18, 2023, Bagas Sanjaya wrote:
> On 5/18/23 20:57, Bagas Sanjaya wrote:
> > Hi,
> >
> > I notice a regression report on Bugzilla [1]. Quoting from it:
> >
> >> I'm experiencing sporadic but persistent segmentation faults on the KVM
> >> VMs I manage. These faults began appearing after upgrading from Linux
> >> Kernel 4.x to 5.15.59. I further upgraded to 5.15.91 and transitioned the
> >> userspace from Debian 10 (buster) to Debian 11 (bullseye), yet the issues
> >> persist. Notably, the libc has also changed in the process as seen in the
> >> following error logs:

Was the host or guest kernel upgraded? If the guest kernel was upgraded, it's
unlikely, though still possible, that this is a KVM bug.

> >> post.sh[21952]: bad frame in rt_sigreturn frame:000072db65961bb8
> >> ip:6c25f82a9a5d sp:72db65962168 orax:ffffffffffffffff in
> >> libc-2.28.so[6c25f8294000+147000]
> >>
> >> cron[7626]: bad frame in rt_sigreturn frame:000073ddebeb6ff8
> >> ip:72ad9f44d594 sp:73ddebeb75a8 orax:ffffffffffffffff in
> >> libc-2.28.so[72ad9f3a9000+147000]
> >>
> >> cron[64687]: bad frame in rt_sigreturn frame:000073265764b038
> >> ip:67c7b5a0f14a sp:73265764b5f0 orax:ffffffffffffffff in
> >> libc-2.31.so[67c7b596f000+159000]
> >>
> >> worker.py[54568]: bad frame in rt_sigreturn frame:000078eef6591cf8
> >> ip:6c9f9b2a604e sp:78eef6592298 orax:ffffffffffffffff in
> >> libpthread-2.31.so[6c9f9b29a000+10000]
> >>
> >>
> >> The segmentation faults occur 1-3 times daily across approximately 1000
> >> VMs running on hundreds of (supermicro, intel cpu) bare-metal servers.
> >> Currently, there's no reliable way for me to reproduce the issue. I
> >> initially considered this bug -
> >> https://www.spinics.net/lists/linux-tip-commits/msg61293.html - as a
> >> possible cause, but judging from the comments it likely isn't.
> >>
> >> The best approximation to a reproducer I have is a Python script that
> >> initiates several child processes and continuously sends them a sigusr1
> >> signal. Still, it takes a few hours to trigger the issue even when running
> >> this script on several hundred VMs.
> >>
> >> Switching to the 6.x kernel isn't immediately feasible as these are
> >> production systems with specific requirements. The transition is planned
> >> but will likely take several months.
> >>
> >> I'm looking for suggestions on how to more reliably reproduce this
> >> problem. Then I could try different old and new kernels and maybe narrow
> >> it down.
> >
> > See bugzilla for the full thread.
> >
> > Anyway, I'm adding it to regzbot:
> >
> > #regzbot introduced: v4.19..v5.15 https://bugzilla.kernel.org/show_bug.cgi?id=217457
> > #regzbot title: bad frame in rt_sigreturn (libc-related?) regression after 5.15 upgrade
> >
>
> Oops, I forgot to add the reporter:
>
> #regzbot from: Theodor Milkov <tm@xxxxxx>
>
> Sorry for inconvenience.
>
> --
> An old man doll... just what I always wanted! - Clara
>