Re: [PATCH 7/7] PCI: Relax bridge window tail sizing rules

From: Andy Shevchenko
Date: Fri Dec 22 2023 - 08:49:19 EST


On Fri, Dec 22, 2023 at 02:29:01PM +0200, Ilpo Järvinen wrote:
> During remove & rescan cycle, PCI subsystem will recalculate and adjust
> the bridge window sizing that was initially done by "BIOS". The size
> calculation is based on the required alignment of the largest resource
> among the downstream resources as per pbus_size_mem() (unimportant or
> zero parameters marked with "..."):
>
> min_align = calculate_mem_align(aligns, max_order);
> size0 = calculate_memsize(size, ..., min_align);
>
> and then in calculate_memsize():
> size = ALIGN(max(size, ...) + ..., align);
>
> If the original bridge window sizing tried to conserve space, this will
> lead to massive increase of the required bridge window size when the
> downstream has a large disparity in BAR sizes. E.g., with 16MiB and
> 16GiB BARs this results in 32GiB bridge window size even if 16MiB BAR
> does not require gigabytes of space to fit.
>
> When doing remove & rescan for a bus that contains such a PCI device, a
> larger bridge window is suddenly required on rescan but when there is a
> bridge window upstream that is already assigned based on the original
> size, it cannot be enlarged to the new requirement. This causes the
> allocation of the bridge window to fail (0x600000000 > 0x400ffffff):
>
> pci 0000:02:01.0: PCI bridge to [bus 03]
> pci 0000:02:01.0: bridge window [mem 0x40400000-0x405fffff]
> pci 0000:02:01.0: bridge window [mem 0x6000000000-0x6400ffffff 64bit pref]
> pci 0000:01:00.0: PCI bridge to [bus 02-04]
> pci 0000:01:00.0: bridge window [mem 0x40400000-0x406fffff]
> pci 0000:01:00.0: bridge window [mem 0x6000000000-0x6400ffffff 64bit pref]
> ...
> pci_bus 0000:03: busn_res: [bus 03] is released
> pci 0000:03:00.0: reg 0x10: [mem 0x6400000000-0x6400ffffff 64bit pref]
> pci 0000:03:00.0: reg 0x18: [mem 0x6000000000-0x63ffffffff 64bit pref]
> pci 0000:03:00.0: reg 0x30: [mem 0x40400000-0x405fffff pref]
> pci 0000:02:01.0: PCI bridge to [bus 03]
> pci 0000:02:01.0: bridge window [mem 0x40400000-0x405fffff]
> pci 0000:02:01.0: bridge window [mem 0x6000000000-0x6400ffffff 64bit pref]
> pci 0000:02:01.0: BAR 9: no space for [mem size 0x600000000 64bit pref]
> pci 0000:02:01.0: BAR 9: failed to assign [mem size 0x600000000 64bit pref]
> pci 0000:02:01.0: BAR 8: assigned [mem 0x40400000-0x405fffff]
> pci 0000:03:00.0: BAR 2: no space for [mem size 0x400000000 64bit pref]
> pci 0000:03:00.0: BAR 2: failed to assign [mem size 0x400000000 64bit pref]
> pci 0000:03:00.0: BAR 0: no space for [mem size 0x01000000 64bit pref]
> pci 0000:03:00.0: BAR 0: failed to assign [mem size 0x01000000 64bit pref]
> pci 0000:03:00.0: BAR 6: assigned [mem 0x40400000-0x405fffff pref]
> pci 0000:02:01.0: PCI bridge to [bus 03]
> pci 0000:02:01.0: bridge window [mem 0x40400000-0x405fffff]
>
> This is a major surprise for users who are suddenly left with a PCIe
> device that was working fine with the original bridge window sizing.
>
> Even if the already assigned bridge window could be enlarged by
> reallocation in some cases (something the current code does not attempt
> to do), it is not possible in general case and the large amount of
> wasted space at the tail of the bridge window may lead to other
> resource exhaustion problems on Root Complex level (think of multiple
> PCIe cards with VFs and BAR size disparity in a single system).
>
> PCI specifications only expect natural alignment for BARs (PCI Express
> Base Specification, rev. 6.1 sect. 7.5.1.2.1) and minimum of 1MiB
> alignment for the bridge window (PCI Express Base Specification,
> rev 6.1 sect. 7.5.1.3). The current bridge window tail alignment rule
> was introduced in the commit 5d0a8965aea9 ("[PATCH] 2.5.14: New PCI
> allocation code (alpha, arm, parisc) [2/2]") that only states:
> "pbus_size_mem: core stuff; tested with randomly generated sets of
> resources". It does not explain the motivation for the extra tail space
> allocated that is not truly needed by the downstream resources. As
> such, it is far from clear if it ever has been required by any HW.
>
> To prevent PCIe cards with BAR size disparity from becoming unusable
> after remove & rescan cycle, attempt to do a truly minimal allocation
> for memory resources if needed. First check if the normally calculated
> bridge window will not fit into an already assigned upstream resource.
> In such case, try with relaxed bridge window tail sizing rules instead
> where no extra tail space is requested beyond what the downstream
> resources require. Only enforce the alignment requirement of the bridge
> window itself (normally 1MiB).
>
> With this patch, the resources are successfully allocated:
>
> pci 0000:02:01.0: PCI bridge to [bus 03]
> pci 0000:02:01.0: bridge window [mem 0x40400000-0x405fffff]
> pci 0000:02:01.0: bridge window [mem 0x6000000000-0x6400ffffff 64bit pref]
> pci 0000:02:01.0: bridge window [mem 0x6000000000-0x6400ffffff 64bit pref] to [bus 03] requires relaxed alignment rules
> pci 0000:02:01.0: BAR 9: assigned [mem 0x6000000000-0x6400ffffff 64bit pref]
> pci 0000:02:01.0: BAR 8: assigned [mem 0x40400000-0x405fffff]
> pci 0000:03:00.0: BAR 2: assigned [mem 0x6000000000-0x63ffffffff 64bit pref]
> pci 0000:03:00.0: BAR 0: assigned [mem 0x6400000000-0x6400ffffff 64bit pref]
> pci 0000:03:00.0: BAR 6: assigned [mem 0x40400000-0x405fffff pref]
> pci 0000:02:01.0: PCI bridge to [bus 03]
> pci 0000:02:01.0: bridge window [mem 0x40400000-0x405fffff]
> pci 0000:02:01.0: bridge window [mem 0x6000000000-0x6400ffffff 64bit pref]
>
> This patch draws inspiration from the initial investigations and work
> by Mika Westerberg.

...

> + struct resource_constraint constraint = { .max = (resource_size_t)~0ULL,

RESOURCE_SIZE_MAX from limits.h.

> + .align = align };

Also I prefer the style

struct resource_constraint constraint = {
.max = RESOURCE_SIZE_MAX,
.align = align,
};


...

> + if (!r || r == &ioport_resource || r == &iomem_resource)
> + continue;

> + if (!r->parent || (r->flags & mask) != type)

Thinking more about these checks, r->parent should be NULL for the root
resources, hence this check basically covers the second part of the above.

But like you said it's a material for a separate investigation.

> + continue;

...

> + pci_dbg(bus->self,
> + "Assigned bridge window %pR to %pR cannot fit 0x%llx required for %s bridging to %pR\n",
> + r, &bus->busn_res,

> + (unsigned long long)size,

Yeah, casting here is a compromise between good looking message and
kernel code.

> + pci_name(downstream->self),
> + &downstream->busn_res);
> + }

...

> + pbus_upstream_assigned_limit(bus, mask | IORESOURCE_PREFETCH, type,
> + size0, add_align)) {

One line?

...

> + size0 = calculate_memsize(size, min_size, 0, 0,
> + resource_size(b_res), win_align);

One line?

--
With Best Regards,
Andy Shevchenko