Re: [PATCH 2/2] x86, mce: rework use of TIF_MCE_NOTIFY

From: Tony Luck
Date: Tue Jun 14 2011 - 14:02:40 EST


On Mon, Jun 13, 2011 at 7:53 PM, Hidetoshi Seto
<seto.hidetoshi@xxxxxxxxxxxxxx> wrote:
> + * Called in process context that interrupted by MCE and marked with
> + * TIF_MCE_NOTFY, just before returning to errorneous userland.
> + * This code is allowed to sleep.
> + * Attempt possible recovery such as calling the high level VM handler to
> + * process any corrupted pages, and kill/signal current process if required.
>  */
>  void mce_notify_process(void)
>  {
> -       mce_notify_irq();
> -       mce_memory_failure_process();
> +       clear_thread_flag(TIF_MCE_NOTIFY);
> +
> +       /* TBD: do recovery for action required event */
>  }

I liked where this series was going - but I'm not sure how we will
be able to write code to fill in the TBD here. You've got us to
a good state ... the process that hit the action-required error
can't get to user space to re-execute because of TIF_MCE_NOTIFY.
So that part is great. But ... we don't have the information we
need (failing address) to take some action. That was put into
the ring ... and it might still be there, but it could have been
grabbed and handled by the worker thread (???). So the error
might have been handled (or might be in the process of being
handled - we could be racing with the worker) - but we don't know.

I think that for action-required we need to pass the PFN from
the MC handler to this mce_notify_process() function. Andi
put it into the task structure - and although I didn't like that
much (and Ingo hated it even more) - it was a quite simple way
to pass the information. The bad "pfn" *is* task relevant data.
It's the reason that the task can't run, and the only hope to get
the process back onto its feet again.

My detour into task-return-notifiers was a massively more
complex way to achieve this same goal (the pfn there was
dropped into the container structure for the "urn" pointer that
was passed to the handler.)

Maybe I'm missing something obvious - but I think that
to fix the action-required error - we need to know some more
about the error than which task is affected.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/