Re: [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus

From: Leonardo Brás
Date: Tue Nov 08 2022 - 18:10:28 EST


On Mon, 2022-11-07 at 09:10 +0100, Michal Hocko wrote:
> On Fri 04-11-22 22:45:58, Leonardo Brás wrote:
> > On Fri, 2022-11-04 at 09:41 +0100, Michal Hocko wrote:
> > > On Thu 03-11-22 13:53:41, Leonardo Brás wrote:
> > > > On Thu, 2022-11-03 at 16:31 +0100, Michal Hocko wrote:
> > > > > On Thu 03-11-22 11:59:20, Leonardo Brás wrote:
> > > [...]
> > > > > > I understand there will be a locking cost being paid in the isolated CPUs when:
> > > > > > a) The isolated CPU is requesting the stock drain,
> > > > > > b) When the isolated CPUs do a syscall and end up using the protected structure
> > > > > > the first time after a remote drain.
> > > > >
> > > > > And anytime the charging path (consume_stock resp. refill_stock)
> > > > > contends with the remote draining which is out of control of the RT
> > > > > task. It is true that the RT kernel will turn that spin lock into a
> > > > > sleeping RT lock and that could help with potential priority inversions
> > > > > but still quite costly thing I would expect.
> > > > >
> > > > > > Both (a) and (b) should happen during a syscall, and IIUC the a rt workload
> > > > > > should not expect the syscalls to be have a predictable time, so it should be
> > > > > > fine.
> > > > >
> > > > > Now I am not sure I understand. If you do not consider charging path to
> > > > > be RT sensitive then why is this needed in the first place? What else
> > > > > would be populating the pcp cache on the isolated cpu? IRQs?
> > > >
> > > > I am mostly trying to deal with drain_all_stock() calling schedule_work_on() at
> > > > isolated_cpus. Since the scheduled drain_local_stock() will be competing for cpu
> > > > time with the RT workload, we can have preemption of the RT workload, which is a
> > > > problem for meeting the deadlines.
> > >
> > > Yes, this is understood. But it is not really clear to me why would any
> > > draining be necessary for such an isolated CPU if no workload other than
> > > the RT (which pressumably doesn't charge any memory?) is running on that
> > > CPU? Is that the RT task during the initialization phase that leaves
> > > that cache behind or something else?
> >
> > (I am new to this part of the code, so please correct me when I miss something.)
> >
> > IIUC, if a process belongs to a control group with memory control, the 'charge'
> > will happen when a memory page starts getting used by it.
>
> Yes, very broadly speaking.
>
> > So, if we assume a RT load in a isolated CPU will not charge any memory, we are
> > assuming it will never be part of a memory-controlled cgroup.
>
> If the memory cgroup controler is enabled then each user space process
> is a part of some memcg. If there is no specific memcg assigned then it
> will be a root cgroup and that is skipped during most charges except for
> kmem.

Oh, it makes sense.
Thanks for helping me understand that!

>
> > I mean, can we just assume this?
> >
> > If I got that right, would not that be considered a limitation? like
> > "If you don't want your workload to be interrupted by perCPU cache draining,
> > don't put it in a cgroup with memory control".
>
> We definitely do not want userspace make any assumptions on internal
> implementation details like caches.

Perfect, that was my expectation.

>
> > > Sorry for being so focused on this
> > > but I would like to understand on whether this is avoidable by a
> > > different startup scheme or it really needs to be addressed in some way.
> >
> > No worries, I am in fact happy you are giving it this much attention :)
> >
> > I also understand this is a considerable change in the locking strategy, and
> > avoiding that is the first thing that should be tried.
> >
> > >
> > > > One way I thought to solve that was introducing a remote drain, which would
> > > > require a different strategy for locking, since not all accesses to the pcp
> > > > caches would happen on a local CPU.
> > >
> > > Yeah, I am not supper happy about additional spin lock TBH. One
> > > potential way to go would be to completely avoid pcp cache for isolated
> > > CPUs. That would have some performance impact of course but on the other
> > > hand it would give a more predictable behavior for those CPUs which
> > > sounds like a reasonable compromise to me. What do you think?
> >
> > You mean not having a perCPU stock, then?
> > So consume_stock() for isolated CPUs would always return false, causing
> > try_charge_memcg() always walking the slow path?
>
> Exactly.
>
> > IIUC, both my proposal and yours would degrade performance only when we use
> > isolated CPUs + memcg. Is that correct?
>
> Yes, with a notable difference that with your spin lock option there is
> still a chance that the remote draining could influence the isolated CPU
> workload throug that said spinlock. If there is no pcp cache for that
> cpu being used then there is no potential interaction at all.

I see.
But the slow path is slow for some reason, right?
Does not it make use of any locks also? So on normal operation there could be a
potentially larger impact than a spinlock, even though there would be no
scheduled draining.

>
> > If so, it looks like the impact would be even bigger without perCPU stock ,
> > compared to introducing a spinlock.
> >
> > Unless, we are counting to this case where a remote CPU is draining an isolated
> > CPU, and the isolated CPU faults a page, and has to wait for the spinlock to be
> > released in the remote CPU. Well, this seems possible to happen, but I would
> > have to analyze how often would it happen, and how much would it impact the
> > deadlines. I *guess* most of the RT workload's memory pages are pre-faulted
> > before its starts, so it can avoid the faulting latency, but I need to confirm
> > that.
>
> Yes, that is a general practice and the reason why I was asking how real
> of a problem that is in practice. 

I remember this was one common factor on deadlines being missed in the workload
analyzed. Need to redo the test to be sure.

> It is true true that appart from user
> space memory which can be under full control of the userspace there are
> kernel allocations which can be done on behalf of the process and those
> could be charged to memcg as well. So I can imagine the pcp cache could
> be populated even if the process is not faulting anything in during RT
> sensitive phase.

Humm, I think I will apply the change and do a comparative testing with
upstream. This should bring good comparison results.

>
> > On the other hand, compared to how it works now now, this should be a more
> > controllable way of introducing latency than a scheduled cache drain.
> >
> > Your suggestion on no-stocks/caches in isolated CPUs would be great for
> > predictability, but I am almost sure the cost in overall performance would not
> > be fine.
>
> It is hard to estimate the overhead without measuring that. Do you think
> you can give it a try? If the performance is not really acceptable
> (which I would be really surprised) then we can think of a more complex
> solution.

Sure, I can try that.
Do you suggest any specific workload that happens to stress the percpu cache
usage, with usual drains and so? Maybe I will also try with synthetic worloads
also.

>
> > With the possibility of prefaulting pages, do you see any scenario that would
> > introduce some undesirable latency in the workload?
>
> My primary concern would be spin lock contention which is hard to
> predict with something like remote draining.

It makes sense. I will do some testing and come out with results for that.

Thanks for reviewing!
Leo