Re: [PATCH 2/2 v2] sched/wait: Introduce lock breaker in wake_up_page_bit

From: Tim Chen
Date: Fri Aug 25 2017 - 18:19:59 EST


On 08/25/2017 12:58 PM, Linus Torvalds wrote:
> On Fri, Aug 25, 2017 at 9:13 AM, Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx> wrote:
>> Now that we have added breaks in the wait queue scan and allow bookmark
>> on scan position, we put this logic in the wake_up_page_bit function.
>
> Oh, _this_ is the other patch you were talking about. I thought it was
> the NUMA counter threashold that was discussed around the same time,
> and that's why I referred to Mel.
>
> Gods, _this_ patch is ugly. No, I'm not happy with it at all. It
> makes that wait_queue_head much bigger, for this disgusting one use.
>
> So no, this is no good.
>
> Now, maybe the page_wait_table[] itself could be changed to be
> something that is *not* just the wait-queue head.
>
> But if we need to change the page_wait_table[] itself to have more
> information, then we should make it not be a wait-queue at all, we
> should make it be a list of much more specialized entries, indexed by
> the {page,bit} tuple.
>
> And once we do that, we probably *could* use something like two
> specialized lists: one that is wake-all, and one that is wake-one.
>
> So you'd have something like
>
> struct page_wait_struct {
> struct list_node list;
> struct page *page;
> int bit;
> struct llist_head all;
> struct llist_head exclusive;
> };
>
> and then the "page_wait_table[]" would be just an array of
>
> struct page_wait_head {
> spinlock_t lock;
> struct hlist_head list;
> };
>
> and then the rule is:
>
> - each page/bit combination allocates one of these page_wait_struct
> entries when somebody starts waiting for it for the first time (and we
> use the spinlock in the page_wait_head to serialize that list).
>
> - an exclusive wait (ie somebody who waits to get the bit for
> locking) adds itself to the 'exclusive' llist
>
> - non-locking waiters add themselves to the 'all' list
>
> - we can use llist_del_all() to remove the 'all' list and then walk
> it and wake them up without any locks
>
> - we can use llist_del_first() to remove the first exclusive waiter
> and wait _it_ up without any locks.
>
> Hmm? How does that sound? That means that we wouldn't use the normal
> wait-queues at all for the page hash waiting. We'd use this two-level
> list: one list to find the page/bit thing, and then two lists within
> that fdor the wait-queues for just *that* page/bit.
>
> So no need for the 'key' stuff at all, because the page/bit data would
> be in the first data list, and the second list wouldn't have any of
> these traversal issues where you have to be careful and do it one
> entry at a time.

If I have the waker count and flags on the wait_page_queue structure
will that have been more acceptable?

Also I think patch 1 is still a good idea for a fail safe mechanism
in case there are other long wait list.

That said, I do think your suggested approach is cleaner. However, it
is a much more substantial change. Let me take a look and see if I
have any issues implementing it.

Tim



>
> Linus
>