Re: signal: break out of wait loops on kthread_stop()

From: Tvrtko Ursulin
Date: Wed Oct 19 2022 - 16:09:43 EST

Next message: Conor Dooley: "Re: [PATCH] PCI: Remove unnecessary of_irq.h includes"
Previous message: Julia Lawall: "Re: [PATCH v4 2/2] staging: most: dim2: correct misleading struct type name"
In reply to: Sultan Alsawaf: "Re: signal: break out of wait loops on kthread_stop()"
Next in thread: Jason A. Donenfeld: "Re: [Intel-gfx] signal: break out of wait loops on kthread_stop()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 19/10/2022 19:16, Jason A. Donenfeld wrote:

On Wed, Oct 19, 2022 at 06:57:38PM +0100, Tvrtko Ursulin wrote:

On 19/10/2022 17:00, Jason A. Donenfeld wrote:

On Wed, Oct 19, 2022 at 7:31 AM Tvrtko Ursulin
<tvrtko.ursulin@xxxxxxxxxxxxxxx> wrote:

Hi,

A question regarding a7c01fa93aeb ("signal: break out of wait loops on
kthread_stop()") if I may.

We have a bunch code in i915, possibly limited to self tests (ie debug
builds) but still important for our flows, which spawn kernel threads
and exercises parts of the driver.

Problem we are hitting with this patch is that code did not really need
to be signal aware until now. Well to say that more accurately - we were
able to test the code which is normally executed from userspace, so is
signal aware, but not worry about -ERESTARTSYS or -EINTR within the test
cases itself.

For example threads which exercise an internal API for a while until the
parent calls kthread_stop. Now those tests can hit unexpected errors.

Question is how to best approach working around this change. It is of
course technically possible to rework our code in more than one way,
although with some cost and impact already felt due reduced pass rates
in our automated test suites.

Maybe an opt out kthread flag from this new behavior? Would that be
acceptable as a quick fix? Or any other comments?

You can opt out by running `clear_tsk_thread_flag(current,
TIF_NOTIFY_SIGNAL);` at the top of your kthread. But you should really
fix your code instead. Were I your reviewer, I wouldn't merge code
that took the lazy path like that. However, that should work, if you
do opt for the quick fix.

Also, are you confident that the change will not catch anyone else by
surprise? In the original thread I did not spot any concerns about the
kthreads being generally unprepared to start receiving EINTR/ERESTARTSYS
from random call chains.

Pretty sure, yea. i915 is unique in its abuse of the API. Keep in mind
that kthread_stop() also sets KTHREAD_SHOULD_STOP and such. Your code is
abusing the API by calling kthread_run() followed by kthread_stop().

Hm why is kthread_stop() after kthread_run() abuse? I don't see it in kerneldoc that it must not be used for stopping threads.

As evidence of how broken your code actually is, the kthread_stop()
function has a comment that makes it clear, "This can also be called
after kthread_create() instead of calling wake_up_process(): the thread
will exit without calling threadfn()," yet i915 has attempted to hack
around it with ridiculous yields and msleeps. For example:

threads[n] = kthread_run(__igt_breadcrumbs_smoketest,
&t, "igt/%d", n);
...

yield(); /* start all threads before we begin */
msleep(jiffies_to_msecs(i915_selftest.timeout_jiffies));
...
err = kthread_stop(threads[n]);

Or here's another one:

tsk = kthread_run(fn, &thread[i], "igt-%d", i);
...
msleep(10); /* start all threads before we kthread_stop() */
...
status = kthread_stop(tsk);

I mean come on.

This is brittle and bad and kind of ridiculous that it shipped this way.
Now you're asking to extend your brittleness, so that you can avoid the
work of cleaning up 5 call sites. Just clean up those 5 call sites. It's
only 5, as far as I can tell.

Yep the yields and sleeps are horrible and will go. But they are also not relevant for the topic at hand. Issue is signal_pending() in the thread which just happens to now let kthread_stop() exit the thread before the work it used to do. And lack of consistent EINTR/ERESTARTSYS handling throughout.

Luckily I am almost sure this hasn't "shipped" anywhere real, in the sense it is debug only part of the driver.

Never mind, I was not looking for anything more than a suggestion on how to maybe work around it in piece as someone is dealing with the affected call sites.

kthread_wait below is perhaps a bit too indirect, since overall refactoring of the approach will be needed, but thanks anyway.

Thanks,

Tvrtko

Right, but our hand is a bit forced at the moment. Since 6.1-rc1 has
propagated to our development tree on Monday, our automated testing
started failing significantly, which prevents us merging new work until
resolved. So a quick fix trumps the ideal road in the short term. Just
because it is quick.

"Short term" -- somehow I can imagine the short term hack will turn into
a long term one.

Anyway, what I suspect you might actually want as a bandaid is a
"kthread_wait()"-like function, that doesn't try to stop the thread with
KTHREAD_SHOULD_STOP and such, but just waits for the completion:

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 30e5bec81d2b..2699cc45ad15 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -86,6 +86,7 @@ void free_kthread_struct(struct task_struct *k);
void kthread_bind(struct task_struct *k, unsigned int cpu);
void kthread_bind_mask(struct task_struct *k, const struct cpumask *mask);
int kthread_stop(struct task_struct *k);
+int kthread_wait(struct task_struct *k);
bool kthread_should_stop(void);
bool kthread_should_park(void);
bool __kthread_should_park(struct task_struct *k);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index f97fd01a2932..d581d78a3a26 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -715,6 +715,22 @@ int kthread_stop(struct task_struct *k)
}
EXPORT_SYMBOL(kthread_stop);

+int kthread_wait(struct task_struct *k)
+{
+ struct kthread *kthread;
+ int ret;
+
+ get_task_struct(k);
+ kthread = to_kthread(k);
+ wake_up_process(k);
+ wait_for_completion(&kthread->exited);
+ ret = kthread->result;
+ put_task_struct(k);
+
+ return ret;
+}
+EXPORT_SYMBOL(kthread_stop);
+
int kthreadd(void *unused)
{
struct task_struct *tsk = current;

Next message: Conor Dooley: "Re: [PATCH] PCI: Remove unnecessary of_irq.h includes"
Previous message: Julia Lawall: "Re: [PATCH v4 2/2] staging: most: dim2: correct misleading struct type name"
In reply to: Sultan Alsawaf: "Re: signal: break out of wait loops on kthread_stop()"
Next in thread: Jason A. Donenfeld: "Re: [Intel-gfx] signal: break out of wait loops on kthread_stop()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]