Re: [patch update 2 fix] PM: Introduce core framework for run-time PM of I/O devices

From: Rafael J. Wysocki
Date: Wed Jun 17 2009 - 19:07:45 EST

Next message: David Miller: "Re: [patch] ipv4: don't warn about skb ack allocation failures"
Previous message: Herton Ronaldo Krzesinski: "[PATCH v2] Staging: comedi (s626): always use subvendor:subdevice ids in pci id table"
In reply to: Alan Stern: "Re: [patch update 2 fix] PM: Introduce core framework for run-timePM of I/O devices"
Next in thread: Alan Stern: "Re: [patch update 2 fix] PM: Introduce core framework for run-timePM of I/O devices"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Alan,

Thanks a lot for the review!

On Wednesday 17 June 2009, Alan Stern wrote:
> On Wed, 17 Jun 2009, Rafael J. Wysocki wrote:
>
> > Sorry for the broken patch. My mailer started to wordwrap messages
> > automatically and I didn't notice.
> >
> > The correct patch is appended.
>
> > Index: linux-2.6/include/linux/pm.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/pm.h
> > +++ linux-2.6/include/linux/pm.h
>
> > + * @runtime_suspend: Prepare the device for a condition in which it won't be
> > + * able to communicate with the CPU(s) and RAM due to power management.
> > + * This need not mean that the device should be put into a low power state,
> > + * like for example when the device is behind a link, represented by a
>
> Suggested rephrasing: For example, if the device is behind a link
> which is about to be turned off, the device may remain at full power.
> But if the device does go to low power and if device_may_wakeup(dev)
> is true, enable remote wakeup.

Done.

> > +/**
> > + * Device run-time power management state.
> > + *
> > + * These state labels are used internally by the PM core to indicate the current
> > + * status of a device with respect to the PM core operations. They do not
> > + * reflect the actual power state of the device or its status as seen by the
> > + * driver.
> > + *
> > + * RPM_ACTIVE Device is fully operational, no run-time PM requests are
> > + * pending for it.
> > + *
> > + * RPM_IDLE It has been requested that the device be suspended.
> > + * Suspend request has been put into the run-time PM
> > + * workqueue and it's pending execution.
> > + *
> > + * RPM_SUSPENDING Device bus type's ->runtime_suspend() callback is being
> > + * executed.
> > + *
> > + * RPM_SUSPENDED Device bus type's ->runtime_suspend() callback has
> > + * completed successfully. The device is regarded as
> > + * suspended.
> > + *
> > + * RPM_WAKE It has been requested that the device be woken up.
> > + * Resume request has been put into the run-time PM
> > + * workqueue and it's pending execution.
> > + *
> > + * RPM_RESUMING Device bus type's ->runtime_resume() callback is being
> > + * executed.
> > + *
> > + * RPM_ERROR Represents a condition from which the PM core cannot
> > + * recover by itself. If the device's run-time PM status
> > + * field has this value, all of the run-time PM operations
> > + * carried out for the device by the core will fail, until
> > + * the status field is changed to either RPM_ACTIVE or
> > + * RPM_SUSPENDED (it is not valid to use the other values
> > + * in such a situation) by the device's driver or bus type.
> > + * This happens when the device bus type's
> > + * ->runtime_suspend() or ->runtime_resume() callback
> > + * returns error code different from -EAGAIN or -EBUSY.
>
> What about RPM_GRACE?

Forgotten.

Well, I've already replaced it with a counter (more about it below).

> > + */
> > +
> > +#define RPM_ACTIVE 0
> > +#define RPM_IDLE 0x01
> > +#define RPM_SUSPENDING 0x02
> > +#define RPM_SUSPENDED 0x04
> > +#define RPM_WAKE 0x08
> > +#define RPM_RESUMING 0x10
> > +#define RPM_GRACE 0x20
> > +#define RPM_ERROR (-1)
>
> This won't work very well when assigned to an unsigned 6-bit field.

OK, I'm changing it to 0x1F (IOW, all bits set).

> > +
> > +#define RPM_IN_SUSPEND (RPM_SUSPENDING | RPM_SUSPENDED)
> > +#define RPM_INACTIVE (RPM_IDLE | RPM_IN_SUSPEND)
> > +#define RPM_NO_SUSPEND (RPM_WAKE | RPM_RESUMING | RPM_GRACE)
> > +#define RPM_IN_PROGRESS (RPM_SUSPENDING | RPM_RESUMING)
>
> Since each of these is used only once, it would be better not to
> define them as macros. Use the parenthesized expression instead; this
> will be easier for readers to understand.

OK

> > +/**
> > + * __pm_runtime_change_status - Change the run-time PM status of a device.
> > + * @dev: Device to handle.
> > + * @status: Expected current run-time PM status of the device.
> > + * @new_status: New value of the device's run-time PM status.
> > + *
> > + * Change the run-time PM status of the device to @new_status if its current
> > + * value is equal to @status.
> > + */
> > +void __pm_runtime_change_status(struct device *dev, unsigned int status,
>
> If RPM_ERROR is -1 then status better not be unsigned.

That's fixed by redefining RPM_ERROR (see above).

> > + unsigned int new_status)
> > +{
> > + unsigned long flags;
> > +
> > + if (atomic_read(&dev->power.depth) > 0)
> > + return;
>
> Return only if new_status == RPM_SUSPENDED.

Not only then. The dev->power.depth counter was meant to be a "disable
everything" one, because there are situations in which we don't want even
resume to run (probe, release, system-wide suspend, hibernation, resume from
a system sleep state, possibly others).

That said, I overlooked some problems related to it. So, I think to disable
the runtime PM of given device, it will be necessary to run a synchronous
runtime resume with taking a ref to block suspend.

> Is this routine ever called with status equal to anything other than
> RPM_ERROR?

Not at the moment. OK, I'll change it.

> +/**
> + * pm_check_children - Check if all children of a device have been suspended.
> + * @dev: Device to check.
> + *
> + * Returns 0 if all children of the device have been suspended or -EBUSY
> + * otherwise.
> + */
> +static int pm_check_children(struct device *dev)
> +{
> + return dev->power.suspend_skip_children ? 0 :
> + device_for_each_child(dev, NULL, pm_device_suspended);
> +}
>
> Instead of a costly device_for_each_child(), would it be better to
> maintain a counter with the number of unsuspended children?

Hmm. How exactly are we going to count them? The only way I see at the moment
would be to increase this number by one when running pm_runtime_init() for a
new child. Seems doable.

> > +/**
> > + * __pm_runtime_suspend - Run a device bus type's runtime_suspend() callback.
> > + * @dev: Device to suspend.
> > + * @sync: If unset, the funtion has been called via pm_wq.
> > + *
> > + * Check if the status of the device is appropriate and run the
> > + * ->runtime_suspend() callback provided by the device's bus type driver.
> > + * Update the run-time PM flags in the device object to reflect the current
> > + * status of the device.
> > + */
> > +int __pm_runtime_suspend(struct device *dev, bool sync)
> > +{
> > + int error = -EINVAL;
> > +
> > + if (atomic_read(&dev->power.depth) > 0)
> > + return -EBUSY;
>
> Should this test be made inside the scope of the spinlock?

Yes, it should.

> For that matter, should power.depth always be set within the spinlock?
> If it is then it doesn't need to be an atomic_t.

pm_runtime_[dis|en]able() don't take the lock when changing it, but it's
going to be dropped anyway.

> > +
> > + spin_lock(&dev->power.lock);
>
> Should be spin_lock_irq(). Same in other places.

OK, I wasn't sure about that.

> > +
> > + if (dev->power.runtime_status == RPM_ERROR) {
> > + goto out;
> > + } else if (dev->power.runtime_status & RPM_SUSPENDED) {
> > + error = 0;
> > + goto out;
> > + } else if ((dev->power.runtime_status & RPM_NO_SUSPEND)
> > + || (!sync && dev->power.suspend_aborted)) {
> > + /*
> > + * Device is resuming or in a post-resume grace period or
> > + * there's a resume request pending, or a pending suspend
> > + * request has just been cancelled and we're running as a result
> > + * of this request.
> > + */
>
> In the sync case, it might be better to wait until the ongoing resume
> (or resume grace period) is finished and then do the suspend.
>
> Of course, this depends on the context in which the synchronous
> runtime suspend is carried out. Right now, the only such context I
> know of is when the user tells the system to force a USB device into a
> suspended state.

>From the functionality point of view, nothing wrong happens if runtime suspend
fails as long as an error code is returned and the caller has to be prepared
for a failure anyway. Moreover, we never know why the resume is carried out,
so it's not clear whether it will be valid to carry out the suspend after that.

>
> > + error = -EAGAIN;
> > + goto out;
> > + } else if (dev->power.runtime_status == RPM_SUSPENDING) {
> > + spin_unlock(&dev->power.lock);
> > +
> > + /*
> > + * Another suspend is running in parallel with us. Wait for it
> > + * to complete and return.
> > + */
> > + wait_for_completion(&dev->power.work_done);
> > +
> > + return dev->power.runtime_error;
> > + } else if (pm_check_children(dev)) {
> > + /*
> > + * We can only suspend the device if all of its children have
> > + * been suspended.
> > + */
> > + dev->power.runtime_status = RPM_ACTIVE;
> > + error = -EAGAIN;
>
> -EBUSY would be more appropriate.

OK

> > + goto out;
> > + }
>
> > +/**
> > + * pm_cancel_suspend - Cancel a pending suspend request for given device.
> > + * @dev: Device to cancel the suspend request for.
> > + */
> > +static void pm_cancel_suspend(struct device *dev)
> > +{
> > + cancel_delayed_work(&dev->power.runtime_work);
> > + dev->power.runtime_status &= RPM_GRACE;
>
> This looks strange. Aren't we guaranteed at this point that the
> status is RPM_IDLE?

Yes.

> > + dev->power.suspend_aborted = true;
> > +}
> > +
> > +/**
> > + * __pm_runtime_resume - Run a device bus type's runtime_resume() callback.
> > + * @dev: Device to resume.
> > + * @grace: If set, force a post-resume grace period.
> > + *
> > + * Check if the device is really suspended and run the ->runtime_resume()
> > + * callback provided by the device's bus type driver. Update the run-time PM
> > + * flags in the device object to reflect the current status of the device. If
> > + * runtime suspend is in progress while this function is being run, wait for it
> > + * to finish before resuming the device. If runtime suspend is scheduled, but
> > + * it hasn't started yet, cancel it and we're done.
> > + */
> > +int __pm_runtime_resume(struct device *dev, bool grace)
> > +{
> > + int error = -EINVAL;
> ...
> > + } else if (dev->power.runtime_status == RPM_SUSPENDED && dev->parent
> > + && (dev->parent->power.runtime_status & ~RPM_GRACE)) {
> > + spin_unlock(&dev->power.lock);
>
> Here's where you want to increment the parent's depth. Figuring out
> where to decrement it again isn't easy, given the way this routine is
> structured.

Hmm. We can use a local bool variable to store the information that the ref
has been taken for the parent and dereference it when leaving the function.

> > + spin_unlock(&dev->parent->power.lock);
> > +
> > + /* The device's parent is not active. Resume it and repeat. */
> > + error = __pm_runtime_resume(dev->parent, false);
> > + if (error)
> > + return error;
>
> Need to reset error to -EINVAL.

Why -EINVAL?

> > +/**
> > + * pm_request_resume - Schedule run-time resume of given device.
> > + * @dev: Device to resume.
> > + * @grace: If set, force a post-resume grace period.
> > + */
> > +void __pm_request_resume(struct device *dev, bool grace)
> > +{
> > + unsigned long parent_flags = 0, flags;
> > +
> > + repeat:
> > + if (atomic_read(&dev->power.depth) > 0)
> > + return;
> > +
> > + if (dev->parent)
> > + spin_lock_irqsave(&dev->parent->power.lock, parent_flags);
> > + spin_lock_irqsave(&dev->power.lock, flags);
> > +
> > + if (dev->power.runtime_status == RPM_IDLE) {
> > + /* Autosuspend request is pending, no need to resume. */
> > + pm_cancel_suspend(dev);
> > + if (grace)
> > + dev->power.runtime_status |= RPM_GRACE;
> > + goto out;
> > + } else if (!(dev->power.runtime_status & RPM_IN_SUSPEND)) {
> > + goto out;
> > + } else if (dev->parent
> > + && (dev->parent->power.runtime_status & RPM_INACTIVE)) {
> > + spin_unlock_irqrestore(&dev->power.lock, flags);
> > + spin_unlock_irqrestore(&dev->parent->power.lock, parent_flags);
> > +
> > + /* The parent is suspending, suspended or idle. Wake it up. */
> > + __pm_request_resume(dev->parent, false);
> > +
> > + goto repeat;
>
> What if the parent's state is RPM_SUSPENDING? Won't this go into a
> tight loop? You need to test the parent's WAKEUP bit above.

Right.

> > Index: linux-2.6/Documentation/power/runtime_pm.txt
> > ===================================================================
> > --- /dev/null
> > +++ linux-2.6/Documentation/power/runtime_pm.txt
> > @@ -0,0 +1,311 @@
> > +Run-time Power Management Framework for I/O Devices
> > +
> > +(C) 2009 Rafael J. Wysocki <rjw@xxxxxxx>, Novell Inc.
> > +
> > +1. Introduction
> > +
> > +The support for run-time power management (run-time PM) of I/O devices is
>
> s/The support/Support/

OK

> > +provided at the power management core (PM core) level by means of:
>
> > +pm_runtime_enable() and pm_runtime_disable() are used to enable and disable,
> > +respectively, all of the run-time PM core operations. They do it by decreasing
> > +and increasing, respectively, the 'power.depth' field of 'struct device'. If
> > +the value of this field is greater than 0, pm_runtime_suspend(),
> > +pm_request_suspend(), pm_runtime_resume() and so on return immediately without
> > +doing anything and -EBUSY is returned by pm_runtime_suspend(),
> > +pm_runtime_resume() and pm_runtime_resume_grace(). Therefore, if
>
> In your code, pm_runtime_disable() doesn't actually do a resume. So if
> a driver wants to make sure a device is at full power and stays that
> way, it has to call:
>
> pm_runtime_resume(dev);
> pm_runtime_disable(dev);
>
> This is a race; another thread might suspend the device in between.
> It would make more sense to have have pm_runtime_resume() function
> normally even when depth > 0. Then the calls could be made in the
> opposite order and there wouldn't be a race.
>
> The equivalent code in USB does this automatically. The
> runtime-disable routine does a resume if the depth value was
> originally 0,

Yes, we should do that in general.

> and the runtime-enable routine queues a delayed autosuspend request if the
> final depth value is 0.

I don't like this.

> > +pm_runtime_suspend(), pm_request_suspend(), pm_runtime_resume(),
> > +pm_runtime_resume_grace(), pm_request_resume(), and pm_request_resume_grace()
> > +use the 'power.runtime_status' and 'power.suspend_aborted' fields of
> > +'struct device' for mutual synchronization. The 'power.runtime_status' field,
>
> Strictly speaking, they use those fields for mutual cooperation. It's
> the power.lock field which provides synchronization.

OK

> > +pm_runtime_suspend() is used to carry out a run-time suspend of an active
> > +device. It is called directly by a bus type or device driver. An asynchronous
> > +version of it is called by the PM core, to complete a request queued up by
> > +pm_request_suspend(). The only difference between them is the handling of
> > +situations when a queued up suspend request has just been cancelled. Apart from
> > +this, they work in the same way.
> > +* If the device is suspended (i.e. the RPM_SUSPENDED bit is set in the device's
> > + run-time PM status field, 'power.runtime_status'), success is returned.
>
> Blank lines surrounding the *-ed paragraphs would make this more
> readable.

OK

> > +pm_request_resume() and pm_request_resume_grace() are used to queue up a resume
> > +request for a device that is suspended, suspending or has a suspend request
> > +pending. The difference between them is that pm_request_resume_grace() causes
> > +the RPM_GRACE bit to be set in the device's run-time PM status field, which
> > +prevents the PM core from suspending the device or queuing up a suspend request
> > +for it until the RPM_GRACE bit is cleared with the help of pm_runtime_release().
> > +Apart from this, they work in the same way.
>
> Is RPM_GRACE really needed? Can't we accomplish more or less the same
> thing by using the autosuspend delay combined with the depth counter?

No, it's not. As I said above, I replaced it with a counter and then I
realized that 'disable' should in fact do 'resume and get', so we can handle
everything with just one counter.

I'll send a revised patch tomorrow.

Best,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: David Miller: "Re: [patch] ipv4: don't warn about skb ack allocation failures"
Previous message: Herton Ronaldo Krzesinski: "[PATCH v2] Staging: comedi (s626): always use subvendor:subdevice ids in pci id table"
In reply to: Alan Stern: "Re: [patch update 2 fix] PM: Introduce core framework for run-timePM of I/O devices"
Next in thread: Alan Stern: "Re: [patch update 2 fix] PM: Introduce core framework for run-timePM of I/O devices"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]