Re: [PATCH v3 00/24] gpio: rework locking and object life-time control

From: Bartosz Golaszewski
Date: Mon Feb 12 2024 - 05:07:53 EST

On Thu, Feb 8, 2024 at 10:59 AM Bartosz Golaszewski <brgl@xxxxxxxx> wrote:
> From: Bartosz Golaszewski <bartosz.golaszewski@xxxxxxxxxx>
> This is a big rework of locking in GPIOLIB. The current serialization is
> pretty much useless. There is one big spinlock (gpio_lock) that "protects"
> both the GPIO device list, GPIO descriptor access and who knows what else.
> I'm putting "protects" in quotes as in several places the lock is
> taken, released whenever a sleeping function is called and re-taken
> without regards for the "protected" state that may have changed.
> First a little background on what we're dealing with in GPIOLIB. We have
> consumer API functions that can be called from any context explicitly
> (get/set value, set direction) as well as many others which will get
> called in atomic context implicitly (e.g. set config called in certain
> situations from gpiod_direction_output()).
> On the other side: we have GPIO provider drivers whose callbacks may or
> may not sleep depending on the underlying protocol.
> This makes any attempts at serialization quite complex. We typically
> cannot use sleeping locks - we may be called from atomic - but we also
> often cannot use spinlocks - provider callbacks may sleep. Moreover: we
> have close ties with the interrupt and pinctrl subsystems, often either
> calling into them or getting called from them. They use their own locking
> schemes which are at odds with ours (pinctrl uses mutexes, the interrupt
> subsystem can call GPIO helpers with spinlock taken).
> There is also another significant issue: the GPIO device object contains
> a pointer to gpio_chip which is the implementation of the GPIO provider.
> This object can be removed at any point - as GPIOLIB officially supports
> hotplugging with all the dynamic expanders that we provide drivers for -
> and leave the GPIO API callbacks with a suddenly NULL pointer. This is
> a problem that allowed user-space processes to easily crash the kernel
> until we patched it with a read-write semaphore in the user-space facing
> code (but the problem still exists for in-kernel users). This was
> recognized before as evidenced by the implementation of validate_desc()
> but without proper serialization, simple checking for a NULL pointer is
> pointless and we do need a generic solution for that issue as well.
> If we want to get it right - the more lockless we go, the better. This is
> why SRCU seems to be the right candidate for the mechanism to use. In fact
> it's the only mechanism we can use our read-only critical sections to be
> called from atomic and protecc contexts as well as call driver callbacks
> that may sleep (for the latter case).
> We're going to use it in three places: to protect the global list of GPIO
> devices, to ensure consistency when dereferencing the chip pointer in GPIO
> device struct and finally to ensure that users can access GPIO descriptors
> and always see a consistent state.
> We do NOT serialize all API callbacks. This means that provider callbacks
> may be called simultaneously and GPIO drivers need to provide their own
> locking if needed. This is on purpose. First: we only support exclusive
> GPIO usage* so there's no risk of two drivers getting in each other's way
> over the same GPIO. Second: with this series, we ensure enough consistency
> to limit the chance of drivers or user-space users crashing the kernel.
> With additional improvements in handling the flags field in GPIO
> descriptors there's very little to gain, while bitbanging drivers may care
> about the increased performance of going lockless.
> This series brings in one somewhat significant functional change for
> in-kernel users, namely: GPIO API calls, for which the underlying GPIO
> chip is gone, will no longer return 0 and emit a log message but instead
> will return -ENODEV.
> I know this is a lot of code to go through but the more eyes we get on it
> the better.
> Thanks,
> Bartosz
> * - This is not technically true. We do provide the
> GPIOD_FLAGS_BIT_NONEXCLUSIVE flag. However this is just another piece of
> technical debt. This is a hack provided for a single use-case in the
> regulator framework which got out of control and is now used in many
> places that should have never touched it. It's utterly broken and doesn't
> even provide any contract as to what a "shared GPIO" is. I would argue
> that it's the next thing we should address by providing "reference counted
> GPIO enable", not just a flag allowing to request the same GPIO twice
> and then allow two drivers to fight over who toggles it as is the case
> now. For now, let's just treat users of GPIOD_FLAGS_BIT_NONEXCLUSIVE like
> they're consciously and deliberately choosing to risk undefined behavior.
> v2 -> v3:
> - fix SRCU cleanup in error path
> - add a comment explaining the use of gpio_device_find() in sysfs code
> - don't needlessly dereference gdev->chip in sysfs code
> - don't return 1 (INPUT) for NULL descriptors from gpiod_get_direction(),
> rather: return -EINVAL
> - fix debugfs code: take the SRCU read lock in .start() and release it in
> .stop() callbacks for seq file instead of taking it locally which
> doesn't protect the entire seq printout
> - move the removal of the GPIO device from the list before setting the
> chip pointer to NULL
> v1 -> v2:
> - fix jumping over variable initialization in sysfs code
> - fix RCU-related sparse warnings
> - fix a smatch complaint about uninitialized variables (even though it's
> a false positive coming from the fact that scoped_guard() is implemented
> as a for loop
> - fix a potential NULL-pointer dereference in debugfs callbacks
> - improve commit messages
> Bartosz Golaszewski (24):
> gpio: protect the list of GPIO devices with SRCU
> gpio: of: assign and read the hog pointer atomically
> gpio: remove unused logging helpers
> gpio: provide and use gpiod_get_label()
> gpio: don't set label from irq helpers
> gpio: add SRCU infrastructure to struct gpio_desc
> gpio: protect the descriptor label with SRCU
> gpio: sysfs: use gpio_device_find() to iterate over existing devices
> gpio: remove gpio_lock
> gpio: reinforce desc->flags handling
> gpio: remove unneeded code from gpio_device_get_desc()
> gpio: sysfs: extend the critical section for unregistering sysfs
> devices
> gpio: sysfs: pass the GPIO device - not chip - to sysfs callbacks
> gpio: cdev: replace gpiochip_get_desc() with gpio_device_get_desc()
> gpio: cdev: don't access gdev->chip if it's not needed
> gpio: sysfs: don't access gdev->chip if it's not needed
> gpio: don't dereference gdev->chip in gpiochip_setup_dev()
> gpio: reduce the functionality of validate_desc()
> gpio: remove unnecessary checks from gpiod_to_chip()
> gpio: add the can_sleep flag to struct gpio_device
> gpio: add SRCU infrastructure to struct gpio_device
> gpio: protect the pointer to gpio_chip in gpio_device with SRCU
> gpio: remove the RW semaphore from the GPIO device
> gpio: mark unsafe gpio_chip manipulators as deprecated
> drivers/gpio/gpiolib-cdev.c | 92 ++--
> drivers/gpio/gpiolib-of.c | 4 +-
> drivers/gpio/gpiolib-sysfs.c | 151 ++++---
> drivers/gpio/gpiolib.c | 791 +++++++++++++++++++----------------
> drivers/gpio/gpiolib.h | 86 ++--
> 5 files changed, 635 insertions(+), 489 deletions(-)
> --
> 2.40.1

Applied the series. I'll let it spend some time in next and let's fix
any remaining issues in tree.