Re: [RFC 0/3] Implementation of cgroup isolation

From: Zhu Yanhai
Date: Tue Mar 29 2011 - 09:16:24 EST


Michal,
Maybe what we need here is some kind of trade-off?
Let's say a new configuable parameter reserve_limit, for the cgroups
which want to
have some guarantee in the memory resource, we have:

limit_in_bytes > soft_limit > reserve_limit

MEM[limit_in_bytes..soft_limit] are the bytes that I'm willing to contribute
to the others if they are short of memory.

MEM[soft_limit..reserve_limit] are the bytes that I can afford if the others
are still eager for memory after I gave them MEM[limit_in_bytes..soft_limit].

MEM[reserve_limit..0] are the bytes which is a must for me to guarantee QoS.
Nobody is allowed to steal them.

And reserve_limit is 0 by default for the cgroups who don't care about Qos.

Then the reclaim path also needs some changes, i.e, balance_pgdat():
1) call mem_cgroup_soft_limit_reclaim(), if nr_reclaimed is meet, goto finish.
2) shrink the global LRU list, and skip the pages which belong to the cgroup
who have set a reserve_limit. if nr_reclaimed is meet, goto finish.
3) shrink the cgroups who have set a reserve_limit, and leave them with only
the reserve_limit bytes they need. if nr_reclaimed is meet, goto finish.
4) OOM

Does it make sense?

Thanks,
Zhu Yanhai


2011/3/29 Michal Hocko <mhocko@xxxxxxx>:
> On Tue 29-03-11 18:41:19, KAMEZAWA Hiroyuki wrote:
>> On Tue, 29 Mar 2011 10:59:43 +0200
>> Michal Hocko <mhocko@xxxxxxx> wrote:
>>
>> > On Tue 29-03-11 16:51:17, KAMEZAWA Hiroyuki wrote:
> [...]
>> > > My opinions is to enhance softlimit is better.
>> >
>> > I will look how softlimit can be enhanced to match the expectations but
>> > I'm kind of suspicious it can handle workloads where heuristics simply
>> > cannot guess that the resident memory is important even though it wasn't
>> > touched for a long time.
>> >
>>
>> I think we recommend mlock() or hugepagefs to pin application's work area
>> in usual. And mm guyes have did hardwork to work mm better even without
>> memory cgroup under realisitic workloads.
>
> Agreed. Whenever this approach is possible we recomend the same thing.
>
>> If your worload is realistic but _important_ anonymous memory is swapped out,
>> it's problem of global VM rather than memcg.
>
> I would disagree with you on that. The important thing is that it can be
> defined from many perspectives. One is the kernel which considers long
> unused memory as not _that_ important. And it makes a perfect sense for
> most workloads.
> An important memory for an application can be something that would
> considerably increase the latency just because the memory got paged out
> (be it swap or the storage) because it contains pre-computed
> data that have a big initial costs.
> As you can see there is no mention about the time from the application
> POV because it can depend on the incoming requests which you cannot
> control.
>
>> If you add 'isolate' per process, okay, I'll agree to add isolate per memcg.
>
> What do you mean by isolate per process?
>
> [...]
>> > > > OK, I have tried to explain that in one of the (2nd) patch description.
>> > > > If I move all task from the root group to other group(s) and keep the
>> > > > primary application in the root group I would achieve some isolation as
>> > > > well. That is very much true.
>> > >
>> > > Okay, then, current works well.
>> > >
>> > > > But then there is only one such a group.
>> > >
>> > > I can't catch what you mean. you can create limitless cgroup, anywhere.
>> > > Can't you ?
>> >
>> > This is not about limits. This is about global vs. per-cgroup reclaim
>> > and how much they interact together.
>> >
>> > The everything-in-groups approach with the "primary" service in the root
>> > group (or call it unlimited) works just because all the memory activity
>> > (but the primary service) is caped with the limits so the rest of the
>> > memory can be used by the service. Moreover, in order this to work the
>> > limit for other groups would be smaller then the working set of the
>> > primary service.
>> >
>> > Even if you created a limitless group for other important service they
>> > would still interact together and if one goes wild the other would
>> > suffer from that.
>> >
>>
>> .........I can't understad what is the problem when global reclaim
>> runs just because an application wasn't limited ...or memory are
>> overcomitted.
>
> I am not sure I understand but what I see as a problem is when unrelated
> memory activity triggers reclaim and it pushes out the memory of a
> process group just because the heuristics done by the reclaim algorithm
> do not pick up the right memory - and honestly, no heuristic will fit
> all requirements. Isolation can protect from an unrelated activity
> without new heuristics.
>
> [...]
>> If softlimit (after some improvement) isn't enough, please add some other.
>>
>> What I think of is
>>
>> 1. need to "guarantee" memory usages in future.
>> Â Â"first come, first served" is not good for admins.
>
> this is not in scope of these patchsets but I agree that it would be
> nice to have this guarantee
>
>> 2. need to handle zone memory shortage. Using memory migration
>> Â Âbetween zones will be necessary to avoid pageout.
>
> I am not sure I understand.
>
>>
>> 3. need a knob to say "please reclaim from my own cgroup rather than
>> Â Âaffecting others (if usage > some(soft)limit)."
>
> Isn't this handled already and enhanced by the per-cgroup background
> reclaim patches?
>
>>
>> > [...]
>> > > > > I think you should put tasks in root cgroup to somewhere. It works perfect
>> > > > > against OOM. And if memory are hidden by isolation, OOM will happen easier.
>> > > >
>> > > > Why do you think that it would happen easier? Isn't it similar (from OOM
>> > > > POV) as if somebody mlocked that memory?
>> > > >
>> > >
>> > > if global lru scan cannot find victim memory, oom happens.
>> >
>> > Yes, but this will happen with mlocked memory as well, right?
>> >
>> Yes, of course.
>>
>> Anyway, I'll Nack to simple "first come, first served" isolation.
>> Please implement garantee, which is reliable and admin can use safely.
>
> Isolation is not about future guarantee. It is rather after you have it
> you can rely it will stay in unless in-group activity pushes it out.
>
>> mlock() has similar problem, So, I recommend hugetlbfs to customers,
>> admin can schedule it at boot time.
>> (the number of users of hugetlbfs is tend to be one app. (oracle))
>
> What if we decide that hugetlbfs won't be pinned into memory in future?
>
>>
>> I'll be absent, tomorrow.
>>
>> I think you'll come LSF/MM summit and from the schedule, you'll have
>> a joint session with Ying as "Memcg LRU management and isolation".
>
> I didn't have plans to do a session actively, but I can certainly join
> to talk and will be happy to discuss this topic.
>
>>
>> IIUC, "LRU management" is a google's performance improvement topic.
>>
>> It's ok for me to talk only about 'isolation' Â1st in earlier session.
>> If you want, please ask James to move session and overlay 1st memory
>> cgroup session. (I think you saw e-mail from James.)
>
> Yeah, I can do that.
>
> Thanks
> --
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9
> Czech Republic
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxxx ÂFor more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/