Re: [PATCH v2 1/3] ipmi: fix msg stack when IPMI is disconnected

From: Corey Minyard
Date: Sun Oct 09 2022 - 07:03:55 EST


On Sun, Oct 09, 2022 at 05:18:09PM +0800, Zhang Yuchen wrote:
> If you continue to access and send messages at a high frequency (once
> every 55s) when the IPMI is disconnected, messages will accumulate in
> intf->[hp_]xmit_msg. If it lasts long enough, it takes up a lot of
> memory.

This is queued for 6.2. Thanks. I already had the other two patches
queued.

-corey

>
> The reason is that if IPMI is disconnected, each message will be set to
> IDLE after it returns to HOSED through IDLE->ERROR0->HOSED. The next
> message goes through the same process when it comes in. This process
> needs to wait for IBF_TIMEOUT * (MAX_ERROR_RETRIES + 1) = 55s.
>
> Each message takes 55S to destroy. This results in a continuous increase
> in memory.
>
> I find that if I wait 5 seconds after the first message fails, the
> status changes to ERROR0 in smi_timeout(). The next message will return
> the error code IPMI_NOT_IN_MY_STATE_ERR directly without wait.
>
> This is more in line with our needs.
>
> So instead of setting each message state to IDLE after it reaches the
> state HOSED, set state to ERROR0.
>
> After testing, the problem has been solved, no matter how many
> consecutive sends, will not cause continuous memory growth. It also
> returns to normal immediately after the IPMI is restored.
>
> In addition, the HOSED state should also count as invalid. So the HOSED
> is removed from the invalid judgment in start_kcs_transaction().
>
> The verification operations are as follows:
>
> 1. Use BPF to record the ipmi_alloc/free_smi_msg().
>
> $ bpftrace -e 'kretprobe:ipmi_alloc_recv_msg {printf("alloc
> %p\n",retval);} kprobe:free_recv_msg {printf("free %p\n",arg0)}'
>
> 2. Exec `date; time for x in $(seq 1 2); do ipmitool mc info; done`.
> 3. Record the output of `time` and when free all msgs.
>
> Before:
>
> `time` takes 120s, This is because `ipmitool mc info` send 4 msgs and
> waits only 15 seconds for each message. Last msg is free after 440s.
>
> $ bpftrace -e 'kretprobe:ipmi_alloc_recv_msg {printf("alloc
> %p\n",retval);} kprobe:free_recv_msg {printf("free %p\n",arg0)}'
> Oct 05 11:40:55 Attaching 2 probes...
> Oct 05 11:41:12 alloc 0xffff9558a05f0c00
> Oct 05 11:41:27 alloc 0xffff9558a05f1a00
> Oct 05 11:41:42 alloc 0xffff9558a05f0000
> Oct 05 11:41:57 alloc 0xffff9558a05f1400
> Oct 05 11:42:07 free 0xffff9558a05f0c00
> Oct 05 11:42:07 alloc 0xffff9558a05f7000
> Oct 05 11:42:22 alloc 0xffff9558a05f2a00
> Oct 05 11:42:37 alloc 0xffff9558a05f5a00
> Oct 05 11:42:52 alloc 0xffff9558a05f3a00
> Oct 05 11:43:02 free 0xffff9558a05f1a00
> Oct 05 11:43:57 free 0xffff9558a05f0000
> Oct 05 11:44:52 free 0xffff9558a05f1400
> Oct 05 11:45:47 free 0xffff9558a05f7000
> Oct 05 11:46:42 free 0xffff9558a05f2a00
> Oct 05 11:47:37 free 0xffff9558a05f5a00
> Oct 05 11:48:32 free 0xffff9558a05f3a00
>
> $ root@dc00-pb003-t106-n078:~# date;time for x in $(seq 1 2); do
> ipmitool mc info; done
>
> Wed Oct 5 11:41:12 CST 2022
> No data available
> Get Device ID command failed
> No data available
> No data available
> No valid response received
> Get Device ID command failed: Unspecified error
> No data available
> Get Device ID command failed
> No data available
> No data available
> No valid response received
> No data available
> Get Device ID command failed
>
> real 1m55.052s
> user 0m0.001s
> sys 0m0.001s
>
> After:
>
> `time` takes 55s, all msgs is returned and free after 55s.
>
> $ bpftrace -e 'kretprobe:ipmi_alloc_recv_msg {printf("alloc
> %p\n",retval);} kprobe:free_recv_msg {printf("free %p\n",arg0)}'
>
> Oct 07 16:30:35 Attaching 2 probes...
> Oct 07 16:30:45 alloc 0xffff955943aa9800
> Oct 07 16:31:00 alloc 0xffff955943aacc00
> Oct 07 16:31:15 alloc 0xffff955943aa8c00
> Oct 07 16:31:30 alloc 0xffff955943aaf600
> Oct 07 16:31:40 free 0xffff955943aa9800
> Oct 07 16:31:40 free 0xffff955943aacc00
> Oct 07 16:31:40 free 0xffff955943aa8c00
> Oct 07 16:31:40 free 0xffff955943aaf600
> Oct 07 16:31:40 alloc 0xffff9558ec8f7e00
> Oct 07 16:31:40 free 0xffff9558ec8f7e00
> Oct 07 16:31:40 alloc 0xffff9558ec8f7800
> Oct 07 16:31:40 free 0xffff9558ec8f7800
> Oct 07 16:31:40 alloc 0xffff9558ec8f7e00
> Oct 07 16:31:40 free 0xffff9558ec8f7e00
> Oct 07 16:31:40 alloc 0xffff9558ec8f7800
> Oct 07 16:31:40 free 0xffff9558ec8f7800
>
> root@dc00-pb003-t106-n078:~# date;time for x in $(seq 1 2); do
> ipmitool mc info; done
> Fri Oct 7 16:30:45 CST 2022
> No data available
> Get Device ID command failed
> No data available
> No data available
> No valid response received
> Get Device ID command failed: Unspecified error
> Get Device ID command failed: 0xd5 Command not supported in present state
> Get Device ID command failed: Command not supported in present state
>
> real 0m55.038s
> user 0m0.001s
> sys 0m0.001s
>
> Signed-off-by: Zhang Yuchen <zhangyuchen.lcr@xxxxxxxxxxxxx>
> ---
> drivers/char/ipmi/ipmi_kcs_sm.c | 16 +++++++++++-----
> 1 file changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/char/ipmi/ipmi_kcs_sm.c b/drivers/char/ipmi/ipmi_kcs_sm.c
> index efda90dcf5b3..ecfcb50302f6 100644
> --- a/drivers/char/ipmi/ipmi_kcs_sm.c
> +++ b/drivers/char/ipmi/ipmi_kcs_sm.c
> @@ -122,10 +122,10 @@ struct si_sm_data {
> unsigned long error0_timeout;
> };
>
> -static unsigned int init_kcs_data(struct si_sm_data *kcs,
> - struct si_sm_io *io)
> +static unsigned int init_kcs_data_with_state(struct si_sm_data *kcs,
> + struct si_sm_io *io, enum kcs_states state)
> {
> - kcs->state = KCS_IDLE;
> + kcs->state = state;
> kcs->io = io;
> kcs->write_pos = 0;
> kcs->write_count = 0;
> @@ -140,6 +140,12 @@ static unsigned int init_kcs_data(struct si_sm_data *kcs,
> return 2;
> }
>
> +static unsigned int init_kcs_data(struct si_sm_data *kcs,
> + struct si_sm_io *io)
> +{
> + return init_kcs_data_with_state(kcs, io, KCS_IDLE);
> +}
> +
> static inline unsigned char read_status(struct si_sm_data *kcs)
> {
> return kcs->io->inputb(kcs->io, 1);
> @@ -270,7 +276,7 @@ static int start_kcs_transaction(struct si_sm_data *kcs, unsigned char *data,
> if (size > MAX_KCS_WRITE_SIZE)
> return IPMI_REQ_LEN_EXCEEDED_ERR;
>
> - if ((kcs->state != KCS_IDLE) && (kcs->state != KCS_HOSED)) {
> + if (kcs->state != KCS_IDLE) {
> dev_warn(kcs->io->dev, "KCS in invalid state %d\n", kcs->state);
> return IPMI_NOT_IN_MY_STATE_ERR;
> }
> @@ -495,7 +501,7 @@ static enum si_sm_result kcs_event(struct si_sm_data *kcs, long time)
> }
>
> if (kcs->state == KCS_HOSED) {
> - init_kcs_data(kcs, kcs->io);
> + init_kcs_data_with_state(kcs, kcs->io, KCS_ERROR0);
> return SI_SM_HOSED;
> }
>
> --
> 2.30.2
>