Re: [RFC PATCH 0/3] clk: sunxi-ng: Optimize rate selection for NKM clocks

From: Maxime Ripard
Date: Wed Jun 07 2023 - 07:42:31 EST

Next message: Johan Hovold: "Re: [PATCH v8 6/9] usb: dwc3: qcom: Add multiport controller support for qcom wrapper"
Previous message: Sapkal Swapnil: "Re: [RFC PATCH V3 0/1] sched/numa: Fix disjoint set vma scan regression"
In reply to: Frank Oltmanns: "Re: [RFC PATCH 0/3] clk: sunxi-ng: Optimize rate selection for NKM clocks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Jun 06, 2023 at 10:40:34PM +0200, Frank Oltmanns wrote:
> Hi Maxime,
>
> On 2023-06-06 at 16:02:58 +0200, Maxime Ripard <mripard@xxxxxxxxxx> wrote:
> > [[PGP Signed Part:Undecided]]
> > On Mon, Jun 05, 2023 at 12:31:41PM +0200, Frank Oltmanns wrote:
> >> Hi Maxime,
> >>
> >> On 2023-06-02 at 09:31:59 +0200, Maxime Ripard <mripard@xxxxxxxxxx> wrote:
> >> > [[PGP Signed Part:Undecided]]
> >> > Hi,
> >> >
> >> > On Thu, Jun 01, 2023 at 07:16:45AM +0200, Frank Oltmanns wrote:
> >> >> On 2023-05-31 at 15:48:43 +0200, Maxime Ripard <mripard@xxxxxxxxxx> wrote:
> >> >> > [[PGP Signed Part:Undecided]]
> >> >> > Hi Frank,
> >> >> >
> >> >> > On Sat, May 27, 2023 at 03:27:44PM +0200, Frank Oltmanns wrote:
> >> >> >> I would like to bring your attention to the current process of setting
> >> >> >> the rate of an NKM clock. As it stands, when setting the rate of an
> >> >> >> NKM clock, the rate nearest but less than or equal to the requested
> >> >> >> rate is found, instead of the nearest rate.
> >> >> >
> >> >> > Yeah, it's actually pretty common, see clk_mux_determine_rate_flags()
> >> >> > for example. Some devices require that we don't overshoot, while some
> >> >> > prefer to have the closest rate.
> >> >> >
> >> >> > Both are fine, and it's a bit context specific which one we should
> >> >> > favour. If we were to do anything, it would be to support both and let
> >> >> > the clock driver select which behaviour it wants.
> >> >> >
> >> >>
> >> >> Ok, understood. Thank you for the explanation! Now, I'm wondering if
> >> >> anyone would be using such a flag, if I added it.
> >> >
> >> > I guess that's another thing :) If no-one is going to use it, why should
> >> > we do it in the first place?
> >> >
> >> > But most likely the display and audio clocks are usually fairly ok with
> >> > overshooting a bit, and a closest rate is usually better.
> >>
> >> Ok, I dived a bit deeper into this, but, as far as I can tell, the
> >> closest rate is not used anywhere in the sunxi-ng ccu driver. So, when
> >> extending, e.g., the NM or NKM clock to support, one must also extend at
> >> least the mux clocks, because they expect rates less than the requested
> >> rate. That seems to be quite the undertaking for only a small gain in
> >> precision.
> >
> > mux clocks are using __clk_mux_determine_rate which should have the
> > behaviour you want when CLK_MUX_ROUND_CLOSEST is set.
>
> https://elixir.bootlin.com/linux/v6.3.6/source/drivers/clk/sunxi-ng/ccu-sun50i-a64.c#L539
> is one example of a mux clock combined with a divider that is a bit more
> complex. I didn't look too deeply, but it seemed to me, that it would
> require two separate flags: One for the mux component and one for the
> div component. Maybe I'm mistaken, but it seems to me that the concept
> of having selected rates always be equal to or less than requested
> rates, seems to be deeply ingrained in the sunxi-ng driver. I'm afraid
> that I might miss some parts, therefore I abandoned that idea for now
> (especially since I have only one board for testing).

All the implementations (afaik) of the "composite" clocks we have that
combine a mux and dividers eventually use ccu_mux_helper_determine_rate
and divider_round_rate_parent. The last one can use
CLK_DIVIDER_ROUND_CLOSEST, and you covered the first one in your patch.
Which one did you leave out?

> >> >> >> Moreover, ccu_nkm_find_best() is called multiple times (footnote [1])
> >> >> >> when setting a rate, each time iterating over all combinations of n,
> >> >> >> k, and m.
> >> >> >
> >> >> > Yeah, that's expected as well.
> >> >>
> >> >> I'm wondering though, if iterating over all combinations is set in
> >> >> stone, or if some kind of optimization would be in order.
> >> >
> >> > The thing with optimization is that you need to optimize for something.
> >> > So you need to show that this code is suboptimal (by whatever metric you
> >> > want to optimize for), and that your code is more optimal that it used
> >> > to be.
> >> >
> >> > It means identifying a problem, writing benchmarks, and showing that the
> >> > new code performs better there.
> >> >
> >> > For example, your code might very well be faster, but it will increase
> >> > the kernel image (and thus the RAM usage). One is not more optimal than
> >> > the other in absolute, they both are, using a different metric.
> >>
> >> Sure, I get that. I'll submit a patchset that adds the functionality to
> >> NKM clocks to set the rate of their parents.
> >>
> >> With the new patchset, the time for, e.g. setting DCLK increases from
> >> ~0.5 ms to a whopping 30 - 37 ms. Those times were taken
> >> unscientifically on my pinephone, i.e. kernel logging and a couple of
> >> re-boots. But I think that still might give an idea of why I thought
> >> about the need to increase performance.
> >>
> >> The reason for this massive increase is, that the patch iterates over
> >> all combinations of NKM for pll-mipi, and for each combination it
> >> iterates over all combinations of NM for pll-video0.
> >>
> >> Nevertheless, following your and Jernej's advice, I'll submit the
> >> patchset first and then we can discuss if speed optimizations are needed
> >> and what cost is acceptable.
> >
> > Honestly, for 40ms, it will be a hard sell :)
>
> I'm not sure what part you think is the "hard-sell":
> a. the patch itself because 30 to 40 ms is way too much
> b. the optimization, because 30 to 40 ms isn't all that much.
> I honestly don't know.
>
> BTW, this is the patchset in case you missed it:
> https://lore.kernel.org/lkml/20230605190745.366882-1-frank@xxxxxxxxxxxx/
>
> Note, that I have a patch in the works, which is similar to the one in
> this thread, but for ccu_nm. Doing a binary search for finding the
> parent rate of pll-mipi, i.e., pll-video0, reduces the time from ~30 ms
> to less than 2 ms. If combined with only iterating through meaningful
> nkm combinations for pll-mipi, this should bring the time under 1 ms
> again.

My point is that:

* it's regression prone, so quite risky

* but if the endgoal is to gain 40ms once, at boot, without any use
case to back that up, there's basically no reward.

a high risk, low reward patch is what's hard to sell.

Maxime

Attachment: signature.asc
Description: PGP signature

Next message: Johan Hovold: "Re: [PATCH v8 6/9] usb: dwc3: qcom: Add multiport controller support for qcom wrapper"
Previous message: Sapkal Swapnil: "Re: [RFC PATCH V3 0/1] sched/numa: Fix disjoint set vma scan regression"
In reply to: Frank Oltmanns: "Re: [RFC PATCH 0/3] clk: sunxi-ng: Optimize rate selection for NKM clocks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]