Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages

From: Gleb Natapov
Date: Tue May 07 2013 - 10:57:24 EST

Next message: Dave Jones: "Re: 3.9+ tty lockdep trace."
Previous message: Sachin Kamat: "Re: [PATCH] drm/exynos: fix error return code in exynos_drm_ipp_set_property()"
In reply to: Marcelo Tosatti: "Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages"
Next in thread: Marcelo Tosatti: "Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, May 07, 2013 at 11:33:29AM -0300, Marcelo Tosatti wrote:
> On Tue, May 07, 2013 at 01:00:51PM +0300, Gleb Natapov wrote:
> > On Tue, May 07, 2013 at 05:41:35PM +0800, Xiao Guangrong wrote:
> > > On 05/07/2013 04:58 PM, Gleb Natapov wrote:
> > > > On Tue, May 07, 2013 at 01:45:52AM +0800, Xiao Guangrong wrote:
> > > >> On 05/07/2013 01:24 AM, Gleb Natapov wrote:
> > > >>> On Mon, May 06, 2013 at 09:10:11PM +0800, Xiao Guangrong wrote:
> > > >>>> On 05/06/2013 08:36 PM, Gleb Natapov wrote:
> > > >>>>
> > > >>>>>>> Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
> > > >>>>>>> spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all
> > > >>>>>>> releases mmu_lock and reacquires it again, only shadow pages
> > > >>>>>>> from the generation with which kvm_mmu_zap_all started are zapped (this
> > > >>>>>>> guarantees forward progress and eventual termination).
> > > >>>>>>>
> > > >>>>>>> kvm_mmu_zap_generation()
> > > >>>>>>> spin_lock(mmu_lock)
> > > >>>>>>> int generation = kvm->arch.mmu_generation;
> > > >>>>>>>
> > > >>>>>>> for_each_shadow_page(sp) {
> > > >>>>>>> if (sp->generation == kvm->arch.mmu_generation)
> > > >>>>>>> zap_page(sp)
> > > >>>>>>> if (spin_needbreak(mmu_lock)) {
> > > >>>>>>> kvm->arch.mmu_generation++;
> > > >>>>>>> cond_resched_lock(mmu_lock);
> > > >>>>>>> }
> > > >>>>>>> }
> > > >>>>>>>
> > > >>>>>>> kvm_mmu_zap_all()
> > > >>>>>>> spin_lock(mmu_lock)
> > > >>>>>>> for_each_shadow_page(sp) {
> > > >>>>>>> if (spin_needbreak(mmu_lock)) {
> > > >>>>>>> cond_resched_lock(mmu_lock);
> > > >>>>>>> }
> > > >>>>>>> }
> > > >>>>>>>
> > > >>>>>>> Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
> > > >>>>>>> Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
> > > >>>>>>>
> > > >>>>>>> This addresses the main problem: excessively long hold times
> > > >>>>>>> of kvm_mmu_zap_all with very large guests.
> > > >>>>>>>
> > > >>>>>>> Do you see any problem with this logic? This was what i was thinking
> > > >>>>>>> we agreed.
> > > >>>>>>
> > > >>>>>> No. I understand it and it can work.
> > > >>>>>>
> > > >>>>>> Actually, it is similar with Gleb's idea that "zapping stale shadow pages
> > > >>>>>> (and uses lock break technique)", after some discussion, we thought "only zap
> > > >>>>>> shadow pages that are reachable from the slot's rmap" is better, that is this
> > > >>>>>> patchset does.
> > > >>>>>> (https://lkml.org/lkml/2013/4/23/73)
> > > >>>>>>
> > > >>>>> But this is not what the patch is doing. Close, but not the same :)
> > > >>>>
> > > >>>> Okay. :)
> > > >>>>
> > > >>>>> Instead of zapping shadow pages reachable from slot's rmap the patch
> > > >>>>> does kvm_unmap_rmapp() which drop all spte without zapping shadow pages.
> > > >>>>> That is why you need special code to re-init lpage_info. What I proposed
> > > >>>>> was to call zap_page() on all shadow pages reachable from rmap. This
> > > >>>>> will take care of lpage_info counters. Does this make sense?
> > > >>>>
> > > >>>> Unfortunately, no! We still need to care lpage_info. lpage_info is used
> > > >>>> to count the number of guest page tables in the memslot.
> > > >>>>
> > > >>>> For example, there is a memslot:
> > > >>>> memslot[0].based_gfn = 0, memslot[0].npages = 100,
> > > >>>>
> > > >>>> and there is a shadow page:
> > > >>>> sp->role.direct =0, sp->role.level = 4, sp->gfn = 10.
> > > >>>>
> > > >>>> this sp is counted in the memslot[0] but it can not be found by walking
> > > >>>> memslot[0]->rmap since there is no last mapping in this shadow page.
> > > >>>>
> > > >>> Right, so what about walking mmu_page_hash for each gfn belonging to the
> > > >>> slot that is in process to be removed to find those?
> > > >>
> > > >> That will cost lots of time. The size of hashtable is 1 << 10. If the
> > > >> memslot has 4M memory, it will walk all the entries, the cost is the same
> > > >> as walking active_list (maybe litter more). And a memslot has 4M memory is
> > > >> the normal case i think.
> > > >>
> > > > Memslots will be much bigger with memory hotplug. Lock break should be
> > > > used while walking mmu_page_hash obviously, but still iterating over
> > > > entire memslot gfn space to find a few gfn that may be there is
> > > > suboptimal. We can keep a list of them in the memslot itself.
> > >
> > > It sounds good to me.
> > >
> > > BTW, this approach looks more complex and use more memory (new list_head
> > > added into every shadow page) used, why you dislike clearing lpage_info? ;)
> > >
> > Looks a little bit hackish, but now that I see we do not have easy way
> > to find all shadow pages counted in lpage_info I am not entirely against
> > it. If you convince Marcelo that clearing lpage_info like that is a good
> > idea I may reconsider. I think, regardless of tracking lpage_info,
> > having a way to find all shadow pages that reference a memslot is a good
> > thing though.
> >
> > > >
> > > >> Another point is that lpage_info stops mmu to use large page. If we
> > > >> do not reset lpage_info, mmu is using 4K page until the invalid-sp is
> > > >> zapped.
> > > >>
> > > > I do not think this is a big issue. If lpage_info prevented the use of
> > > > large pages for some memory ranges before we zapped entire shadow pages
> > > > it was probably for a reason, so new shadow page will prevent large
> > > > pages from been created for the same memory ranges.
> > >
> > > Still worried, but I will try it if Marcelo does not have objects.
> > > Thanks a lot for your valuable suggestion, Gleb!
> > >
> > > Now, i am trying my best to catch Marcelo's idea of "zapping root
> > > pages", but......
> > >
> > Yes, I am missing what Marcelo means there too. We cannot free memslot
> > until we unmap its rmap one way or the other.
>
> I do not understand what are you optimizing for, given the four possible
> cases we discussed at
>
> https://lkml.org/lkml/2013/4/18/280
>
We are optimizing mmu_lock holding time for all of those cases.

But you cannot just "zap roots + sp gen number increase." on slot
deletion because you need to transfer access/dirty information from rmap
that is going to be deleted to actual page before
kvm_set_memory_region() returns to a caller.

> That is, why a simple for_each_all_shadow_page(zap_page) is not sufficient.
With a lock break? It is. We tried to optimize that by zapping only pages
that reference memslot that is going to be deleted and zap all other
later when recycling old sps, but if you think this is premature
optimization I am fine with it.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Dave Jones: "Re: 3.9+ tty lockdep trace."
Previous message: Sachin Kamat: "Re: [PATCH] drm/exynos: fix error return code in exynos_drm_ipp_set_property()"
In reply to: Marcelo Tosatti: "Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages"
Next in thread: Marcelo Tosatti: "Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]