Re: [PATCH 2/2] serial: qcom-geni: Don't cancel/abort if we can't get the port lock

From: Bjorn Andersson
Date: Sat Jan 27 2024 - 22:10:52 EST


On Fri, Jan 12, 2024 at 03:03:08PM -0800, Douglas Anderson wrote:
> As of commit d7402513c935 ("arm64: smp: IPI_CPU_STOP and
> IPI_CPU_CRASH_STOP should try for NMI"), if we've got pseudo-NMI
> enabled then we'll use it to stop CPUs at panic time. This is nice,
> but it does mean that there's a pretty good chance that we'll end up
> stopping a CPU while it holds the port lock for the console
> UART. Specifically, I see a CPU get stopped while holding the port
> lock nearly 100% of the time on my sc7180-trogdor based Chromebook by
> enabling the "buddy" hardlockup detector and then doing:
>
> sysctl -w kernel.hardlockup_all_cpu_backtrace=1
> sysctl -w kernel.hardlockup_panic=1
> echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
>
> UART drivers are _supposed_ to handle this case OK and this is why
> UART drivers check "oops_in_progress" and only do a "trylock" in that
> case. However, before we enabled pseudo-NMI to stop CPUs it wasn't a
> very well-tested situation.
>
> Now that we're testing the situation a lot, it can be seen that the
> Qualcomm GENI UART driver is pretty broken. Specifically, when I run
> my test case and look at the console output I just see a bunch of
> garbled output like:
>
> [ 201.069084] NMI backtrace[ 201.069084] NM[ 201.069087] CPU: 6
> PID: 10296 Comm: dnsproxyd Not tainted 6.7.0-06265-gb13e8c0ede12
> #1 01112b9f14923cbd0b[ 201.069090] Hardware name: Google Lazor
> ([ 201.069092] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DI[
> 201.069095] pc : smp_call_function_man[ 201.069099]
>
> That's obviously not so great. This happens because each call to the
> console driver exits after the data has been written to the FIFO but
> before it's actually been flushed out of the serial port. When we have
> multiple calls into the console one after the other then (if we can't
> get the lock) each call tells the UART to throw away any data in the
> FIFO that hadn't been transferred yet.
>
> I've posted up a patch to change the arm64 core to avoid this
> situation most of the time [1] much like x86 seems to do, but even if
> that patch lands the GENI driver should still be fixed.
>
> From testing, it appears that we can just delete the cancel/abort in
> the case where we weren't able to get the UART lock and the output
> looks good. It makes sense that we'd be able to do this since that
> means we'll just call into __qcom_geni_serial_console_write() and
> __qcom_geni_serial_console_write() looks much like
> qcom_geni_serial_poll_put_char() but with a loop. However, it seems
> safest to poll the FIFO and make sure it's empty before our
> transfer. This should reliably make sure that we're not
> interrupting/clobbering any existing transfers.
>
> As part of this change, we'll also avoid re-setting up a TX at the end
> of the console write function if we weren't able to get the lock,
> since accessing "port->tx_remaining" without the lock is not
> safe. This is only needed to re-start userspace initiated transfers.
>
> [1] https://lore.kernel.org/r/20231207170251.1.Id4817adef610302554b8aa42b090d57270dc119c@changeid
>
> Signed-off-by: Douglas Anderson <dianders@xxxxxxxxxxxx>

Reviewed-by: Bjorn Andersson <andersson@xxxxxxxxxx>

Regards,
Bjorn

> ---
>
> drivers/tty/serial/qcom_geni_serial.c | 27 +++++++++++++--------------
> 1 file changed, 13 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
> index 7e78f97e8f43..06ebe62f99bc 100644
> --- a/drivers/tty/serial/qcom_geni_serial.c
> +++ b/drivers/tty/serial/qcom_geni_serial.c
> @@ -488,18 +488,16 @@ static void qcom_geni_serial_console_write(struct console *co, const char *s,
>
> geni_status = readl(uport->membase + SE_GENI_STATUS);
>
> - /* Cancel the current write to log the fault */
> if (!locked) {
> - geni_se_cancel_m_cmd(&port->se);
> - if (!qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
> - M_CMD_CANCEL_EN, true)) {
> - geni_se_abort_m_cmd(&port->se);
> - qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
> - M_CMD_ABORT_EN, true);
> - writel(M_CMD_ABORT_EN, uport->membase +
> - SE_GENI_M_IRQ_CLEAR);
> - }
> - writel(M_CMD_CANCEL_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
> + /*
> + * We can only get here if an oops is in progress then we were
> + * unable to get the lock. This means we can't safely access
> + * our state variables like tx_remaining. About the best we
> + * can do is wait for the FIFO to be empty before we start our
> + * transfer, so we'll do that.
> + */
> + qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
> + M_TX_FIFO_NOT_EMPTY_EN, false);
> } else if ((geni_status & M_GENI_CMD_ACTIVE) && !port->tx_remaining) {
> /*
> * It seems we can't interrupt existing transfers if all data
> @@ -516,11 +514,12 @@ static void qcom_geni_serial_console_write(struct console *co, const char *s,
>
> __qcom_geni_serial_console_write(uport, s, count);
>
> - if (port->tx_remaining)
> - qcom_geni_serial_setup_tx(uport, port->tx_remaining);
>
> - if (locked)
> + if (locked) {
> + if (port->tx_remaining)
> + qcom_geni_serial_setup_tx(uport, port->tx_remaining);
> uart_port_unlock_irqrestore(uport, flags);
> + }
> }
>
> static void handle_rx_console(struct uart_port *uport, u32 bytes, bool drop)
> --
> 2.43.0.275.g3460e3d667-goog
>