Re: [PATCH v3 00/24] gpio: rework locking and object life-time control

From: Bartosz Golaszewski
Date: Tue Feb 13 2024 - 07:08:58 EST


On Tue, Feb 13, 2024 at 1:05 PM Marek Szyprowski
<m.szyprowski@xxxxxxxxxxx> wrote:
>
> On 08.02.2024 10:58, Bartosz Golaszewski wrote:
> > From: Bartosz Golaszewski <bartosz.golaszewski@xxxxxxxxxx>
> >
> > This is a big rework of locking in GPIOLIB. The current serialization is
> > pretty much useless. There is one big spinlock (gpio_lock) that "protects"
> > both the GPIO device list, GPIO descriptor access and who knows what else.
> >
> > I'm putting "protects" in quotes as in several places the lock is
> > taken, released whenever a sleeping function is called and re-taken
> > without regards for the "protected" state that may have changed.
> >
> > First a little background on what we're dealing with in GPIOLIB. We have
> > consumer API functions that can be called from any context explicitly
> > (get/set value, set direction) as well as many others which will get
> > called in atomic context implicitly (e.g. set config called in certain
> > situations from gpiod_direction_output()).
> >
> > On the other side: we have GPIO provider drivers whose callbacks may or
> > may not sleep depending on the underlying protocol.
> >
> > This makes any attempts at serialization quite complex. We typically
> > cannot use sleeping locks - we may be called from atomic - but we also
> > often cannot use spinlocks - provider callbacks may sleep. Moreover: we
> > have close ties with the interrupt and pinctrl subsystems, often either
> > calling into them or getting called from them. They use their own locking
> > schemes which are at odds with ours (pinctrl uses mutexes, the interrupt
> > subsystem can call GPIO helpers with spinlock taken).
> >
> > There is also another significant issue: the GPIO device object contains
> > a pointer to gpio_chip which is the implementation of the GPIO provider.
> > This object can be removed at any point - as GPIOLIB officially supports
> > hotplugging with all the dynamic expanders that we provide drivers for -
> > and leave the GPIO API callbacks with a suddenly NULL pointer. This is
> > a problem that allowed user-space processes to easily crash the kernel
> > until we patched it with a read-write semaphore in the user-space facing
> > code (but the problem still exists for in-kernel users). This was
> > recognized before as evidenced by the implementation of validate_desc()
> > but without proper serialization, simple checking for a NULL pointer is
> > pointless and we do need a generic solution for that issue as well.
> >
> > If we want to get it right - the more lockless we go, the better. This is
> > why SRCU seems to be the right candidate for the mechanism to use. In fact
> > it's the only mechanism we can use our read-only critical sections to be
> > called from atomic and protecc contexts as well as call driver callbacks
> > that may sleep (for the latter case).
> >
> > We're going to use it in three places: to protect the global list of GPIO
> > devices, to ensure consistency when dereferencing the chip pointer in GPIO
> > device struct and finally to ensure that users can access GPIO descriptors
> > and always see a consistent state.
> >
> > We do NOT serialize all API callbacks. This means that provider callbacks
> > may be called simultaneously and GPIO drivers need to provide their own
> > locking if needed. This is on purpose. First: we only support exclusive
> > GPIO usage* so there's no risk of two drivers getting in each other's way
> > over the same GPIO. Second: with this series, we ensure enough consistency
> > to limit the chance of drivers or user-space users crashing the kernel.
> > With additional improvements in handling the flags field in GPIO
> > descriptors there's very little to gain, while bitbanging drivers may care
> > about the increased performance of going lockless.
> >
> > This series brings in one somewhat significant functional change for
> > in-kernel users, namely: GPIO API calls, for which the underlying GPIO
> > chip is gone, will no longer return 0 and emit a log message but instead
> > will return -ENODEV.
> >
> > I know this is a lot of code to go through but the more eyes we get on it
> > the better.
>
> I've noticed that this patchset landed in today's linux-next. It causes
> a lots of warning during boot on my test boards when LOCKDEP is enabled
> in kernel configs. Do you want me to report all of them? Some can be
> easily reproduced even with QEMU's virt ARM and ARM64 machines.
>

Marek,

Thanks for the report. I've already sent out patches that fix these
problems. Sorry for this.

Bartosz

[snip]