Re: [PATCH v2 2/4] devcoredump: Add dev_coredumpm_timeout()

From: Johannes Berg
Date: Fri Mar 01 2024 - 03:38:59 EST


On Wed, 2024-02-28 at 17:56 +0000, Souza, Jose wrote:
>
> In my opinion, the timeout should depend on the type of device driver.
>
> In the case of server-class Ethernet cards, where corporate users automate most tasks, five minutes might even be considered excessive.
>
> For our case, GPUs, users might experience minor glitches and only search for what happened after finishing their current task (writing an email,
> ending a gaming match, watching a YouTube video, etc.).
> If they land on https://drm.pages.freedesktop.org/intel-docs/how-to-file-i915-bugs.html or the future Xe version of that page, following the
> instructions alone may take inexperienced Linux users more than five minutes.

That's all not wrong, but I don't see why you wouldn't automate this
even on end user machines? I feel you're boxing the problem in by
wanting to solve it entirely in the kernel?

> I have set the timeout to one hour in the Xe driver, but this could increase if we start receiving user complaints.

At an hour now, people will probably start arguing that "indefinitely"
is about right? But at that point you're probably back to persisting
them on disk anyway? Or maybe glitches happen during logout/shutdown ...

Anyway, I don't want to block this because I just don't care enough
about how you do things, but I think the kernel is the wrong place to
solve this problem... The intent here was to give some userspace time to
grab it (and yes for that 5 minutes is already way too long), not the
users. That's also part of the reason we only hold on to a single
instance, since I didn't want it to keep consuming more and more memory
for it if happens repeatedly.

johannes