Re: [PATCH v9 0/2] ACPI: APEI: handle synchronous errors in task work with proper si_code

From: Shuai Xue
Date: Thu Nov 30 2023 - 22:38:01 EST




On 2023/12/1 01:39, James Morse wrote:
> Hi Boris, Shuai,
>
> On 29/11/2023 18:54, Borislav Petkov wrote:
>> On Sun, Nov 26, 2023 at 08:25:38PM +0800, Shuai Xue wrote:
>>>> On Sat, Nov 25, 2023 at 02:44:52PM +0800, Shuai Xue wrote:
>>>>> - an AR error consumed by current process is deferred to handle in a
>>>>> dedicated kernel thread, but memory_failure() assumes that it runs in the
>>>>> current context
>>>>
>>>> On x86? ARM?
>>>>
>>>> Pease point to the exact code flow.
>
>
>>> An AR error consumed by current process is deferred to handle in a
>>> dedicated kernel thread on ARM platform. The AR error is handled in bellow
>>> flow:
>
> Please don't think of errors as "action required" - that's a user-space signal code. If
> the page could be fixed by memory-failure(), you may never get a signal. (all this was the
> fix for always sending an action-required signal)
>
> I assume you mean the CPU accessed a poisoned location and took a synchronous error.

Yes, I mean that CPU accessed a poisoned location and took a synchronous error.
>
>
>>> -----------------------------------------------------------------------------
>>> [usr space task einj_mem_uc consumd data poison, CPU 3] STEP 0
>>>
>>> -----------------------------------------------------------------------------
>>> [ghes_sdei_critical_callback: current einj_mem_uc, CPU 3] STEP 1
>>> ghes_sdei_critical_callback
>>> => __ghes_sdei_callback
>>> => ghes_in_nmi_queue_one_entry // peak and read estatus
>>> => irq_work_queue(&ghes_proc_irq_work) <=> ghes_proc_in_irq // irq_work
>>> [ghes_sdei_critical_callback: return]
>>> -----------------------------------------------------------------------------
>>> [ghes_proc_in_irq: current einj_mem_uc, CPU 3] STEP 2
>>> => ghes_do_proc
>>> => ghes_handle_memory_failure
>>> => ghes_do_memory_failure
>>> => memory_failure_queue // put work task on current CPU
>>> => if (kfifo_put(&mf_cpu->fifo, entry))
>>> schedule_work_on(smp_processor_id(), &mf_cpu->work);
>>> => task_work_add(current, &estatus_node->task_work, TWA_RESUME);
>>> [ghes_proc_in_irq: return]
>>> -----------------------------------------------------------------------------
>>> // kworker preempts einj_mem_uc on CPU 3 due to RESCHED flag STEP 3
>>> [memory_failure_work_func: current kworker, CPU 3]
>>> => memory_failure_work_func(&mf_cpu->work)
>>> => while kfifo_get(&mf_cpu->fifo, &entry); // until get no work
>>> => memory_failure(entry.pfn, entry.flags);
>>
>> From the comment above that function:
>>
>> * The function is primarily of use for corruptions that
>> * happen outside the current execution context (e.g. when
>> * detected by a background scrubber)
>> *
>> * Must run in process context (e.g. a work queue) with interrupts
>> * enabled and no spinlocks held.
>>
>>> -----------------------------------------------------------------------------
>>> [ghes_kick_task_work: current einj_mem_uc, other cpu] STEP 4
>>> => memory_failure_queue_kick
>>> => cancel_work_sync - waiting memory_failure_work_func finish
>>> => memory_failure_work_func(&mf_cpu->work)
>>> => kfifo_get(&mf_cpu->fifo, &entry); // no work
>>> -----------------------------------------------------------------------------
>>> [einj_mem_uc resume at the same PC, trigger a page fault STEP 5
>>>
>>> STEP 0: A user space task, named einj_mem_uc consume a poison. The firmware
>>> notifies hardware error to kernel through is SDEI
>>> (ACPI_HEST_NOTIFY_SOFTWARE_DELEGATED).
>>>
>>> STEP 1: The swapper running on CPU 3 is interrupted. irq_work_queue() rasie
>>> a irq_work to handle hardware errors in IRQ context
>>>
>>> STEP2: In IRQ context, ghes_proc_in_irq() queues memory failure work on
>>> current CPU in workqueue and add task work to sync with the workqueue.
>>>
>>> STEP3: The kworker preempts the current running thread and get CPU 3. Then
>>> memory_failure() is processed in kworker.
>>
>> See above.
>>
>>> STEP4: ghes_kick_task_work() is called as task_work to ensure any queued
>>> workqueue has been done before returning to user-space.
>>>
>>> STEP5: Upon returning to user-space, the task einj_mem_uc resumes at the
>>> current instruction, because the poison page is unmapped by
>>> memory_failure() in step 3, so a page fault will be triggered.
>>>
>>> memory_failure() assumes that it runs in the current context on both x86
>>> and ARM platform.
>>>
>>>
>>> for example:
>>> memory_failure() in mm/memory-failure.c:
>>>
>>> if (flags & MF_ACTION_REQUIRED) {
>>> folio = page_folio(p);
>>> res = kill_accessing_process(current, folio_pfn(folio), flags);
>>> }
>>
>> And?
>>
>> Do you see the check above it?
>>
>> if (TestSetPageHWPoison(p)) {
>>
>> test_and_set_bit() returns true only when the page was poisoned already.
>>
>> * This function is intended to handle "Action Required" MCEs on already
>> * hardware poisoned pages. They could happen, for example, when
>> * memory_failure() failed to unmap the error page at the first call, or
>> * when multiple local machine checks happened on different CPUs.
>>
>> And that's kill_accessing_process().
>>
>> So AFAIU, the kworker running memory_failure() would only mark the page
>> as poison.
>>
>> The killing happens when memory_failure() runs again and the process
>> touches the page again.
>>
>> But I'd let James confirm here.
>
> Yes, this is what is expected to happen with the existing code.
>
> The first pass will remove the pages from all processes that have it mapped before this
> user-space task can restart. Restarting the task will make it access a poisoned page,
> kicking off the second path which delivers the signal.
>
> The reason for two passes is send_sig_mceerr() likes to clear_siginfo(), so even if you
> queued action-required before leaving GHES, memory-failure() would stomp on it.
>
>
>> I still don't know what you're fixing here.
>
> The problem is if the user-space process registered for early messages, it gets a signal
> on the first pass. If it returns from that signal, it will access the poisoned page and
> get the action-required signal.
>
> How is this making Qemu go wrong?

The problem here is that we need to assume, the first pass memory failure
handle and unmap the poisoned page successfully.

- If so, it may work by the second pass action-requried signal because it
access an unmapped page. But IMHO, we can improve by just sending one
pass signal, so that the Guest will vmexit only once, right?

- If not, there is no second pass signal. The exist code does not handle
the error code from memory_failure(), so a exception loop happens
resulting a hard lockup panic.

Besides, in production environment, a second access to an already known
poison page will introduce more risk of error propagation.

>
>
> As to how this works for you given Boris' comments above: kill_procs() is also called from
> hwpoison_user_mappings(), which takes the flags given to memory-failure(). This is where
> the action-optional signals come from.
>
>

Thank you very much for involving to review and comment.

Best Regards,
Shuai