Re: [PATCH] x86/resctrl: Fix mbm_setup_overflow_handler() when last CPU goes offline

From: Reinette Chatre
Date: Thu Mar 28 2024 - 17:11:52 EST


Hi Tony,

On 3/27/2024 4:01 PM, Luck, Tony wrote:
>> There seems to be two issues here (although I am not familiar with these flows). First,
>> it seems that tick_nohz_full_mask is not actually allocated unless the user boots
>> with a "nohz_full=". This means that any attempt to access bits within tick_nohz_full_mask
>> will cause this OOPS. If that is allocated then the second issue seems that the
>> buried __ffs() call requires that it not be called with 0 and this checking is not done.
>>
>> To me it seems most appropriate to fix this at the central place to ensure all scenarios
>> are handled instead of scattering checks.
>
> Good analysis.
>
>> To that end, what do you think of something like below? It uses tick_nohz_full_enabled() check
>> to ensure that tick_nohz_full_mask is actually allocated while the other changes aim to
>> avoid __ffs() on 0.

I studied the flows some more and I no longer believe that there is a risk of __ffs() being
called on 0. Looking at the flows starting with cpumask_nth_andnot() the value provided to
__ffs() is ensured to be non-zero via either an explicit check or hweight_long().

To confirm this I tested a CONFIG_NO_HZ_FULL kernel by booting with "nohz_full=" set to the
highest numbered CPU in the domain. When all the CPUs are offlined the behavior for the
"second to last" and "last" CPU are most interesting and worked as expected when using
cpumask_nth_andnot().

>
> Looks good.
>
> Tested-by: Tony Luck <tony.luck@xxxxxxxxx>
> Reviewed-by: Tont Luck <tony.luck@xxxxxxxxx>

Considering the motivation above I will submit a change that only adds the
tick_nohz_full_enabled() check. Since it is such a big change from the snippet you reviewed
and tested here I dropped your tags from it and hope you can reconsider the fix after considering
the information above.

Reinette