RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.

From: Xin, Xiaohui
Date: Thu Sep 23 2010 - 08:56:45 EST


>-----Original Message-----
>From: Michael S. Tsirkin [mailto:mst@xxxxxxxxxx]
>Sent: Wednesday, September 22, 2010 7:55 PM
>To: Xin, Xiaohui
>Cc: netdev@xxxxxxxxxxxxxxx; kvm@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
>mingo@xxxxxxx; davem@xxxxxxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxxx;
>jdike@xxxxxxxxxxxxxxx
>Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
>
>On Wed, Sep 22, 2010 at 07:41:36PM +0800, Xin, Xiaohui wrote:
>> >-----Original Message-----
>> >From: Michael S. Tsirkin [mailto:mst@xxxxxxxxxx]
>> >Sent: Tuesday, September 21, 2010 9:14 PM
>> >To: Xin, Xiaohui
>> >Cc: netdev@xxxxxxxxxxxxxxx; kvm@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
>> >mingo@xxxxxxx; davem@xxxxxxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxxx;
>> >jdike@xxxxxxxxxxxxxxx
>> >Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
>> >
>> >On Tue, Sep 21, 2010 at 09:39:31AM +0800, Xin, Xiaohui wrote:
>> >> >From: Michael S. Tsirkin [mailto:mst@xxxxxxxxxx]
>> >> >Sent: Monday, September 20, 2010 7:37 PM
>> >> >To: Xin, Xiaohui
>> >> >Cc: netdev@xxxxxxxxxxxxxxx; kvm@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
>> >> >mingo@xxxxxxx; davem@xxxxxxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxxx;
>> >> >jdike@xxxxxxxxxxxxxxx
>> >> >Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
>> >> >
>> >> >On Mon, Sep 20, 2010 at 04:08:48PM +0800, xiaohui.xin@xxxxxxxxx wrote:
>> >> >> From: Xin Xiaohui <xiaohui.xin@xxxxxxxxx>
>> >> >>
>> >> >> ---
>> >> >> Michael,
>> >> >> I have move the ioctl to configure the locked memory to vhost
>> >> >
>> >> >It's ok to move this to vhost but vhost does not
>> >> >know how much memory is needed by the backend.
>> >>
>> >> I think the backend here you mean is mp device.
>> >> Actually, the memory needed is related to vq->num to run zero-copy
>> >> smoothly.
>> >> That means mp device did not know it but vhost did.
>> >
>> >Well, this might be so if you insist on locking
>> >all posted buffers immediately. However, let's assume I have a
>> >very large ring and prepost a ton of RX buffers:
>> >there's no need to lock all of them directly:
>> >
>> >if we have buffers A and B, we can lock A, pass it
>> >to hardware, and when A is consumed unlock A, lock B
>> >and pass it to hardware.
>> >
>> >
>> >It's not really critical. But note we can always have userspace
>> >tell MP device all it wants to know, after all.
>> >
>> Ok. Here are two values we have mentioned, one is how much memory
>> user application wants to lock, and one is how much memory locked
>> is needed to run smoothly. When net backend is setup, we first need
>> an ioctl to get how much memory is needed to lock, and then we call
>> another ioctl to set how much it want to lock. Is that what's in your mind?
>
>That's fine.
>
>> >> And the rlimt stuff is per process, we use current pointer to set
>> >> and check the rlimit, the operations should be in the same process.
>> >
>> >Well no, the ring is handled from the kernel thread: we switch the mm to
>> >point to the owner task so copy from/to user and friends work, but you
>> >can't access the rlimit etc.
>> >
>> Yes, the userspace and vhost kernel is not the same process. But we can
>> record the task pointer as mm.
>
>So you will have to store mm and do device->mm, not current->mm.
>Anyway, better not touch mm on data path.
>
>> >> Now the check operations are in vhost process, as mp_recvmsg() or
>> >> mp_sendmsg() are called by vhost.
>> >
>> >Hmm, what do you mean by the check operations?
>> >send/recv are data path operations, they shouldn't
>> >do any checks, should they?
>> >
>> As you mentioned what infiniband driver done:
>> down_write(&current->mm->mmap_sem);
>>
>> locked = npages + current->mm->locked_vm;
>> lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>>
>> if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
>> ret = -ENOMEM;
>> goto out;
>> }
>>
>> cur_base = addr & PAGE_MASK;
>>
>> ret = 0;
>> while (npages) {
>> ret = get_user_pages(current, current->mm, cur_base,
>> min_t(unsigned long, npages,
>> PAGE_SIZE / sizeof (struct page *)),
>> 1, !umem->writable, page_list, vma_list);
>>
>> I think it's a data path too.
>
>in infiniband this is used to 'register memory' which is not data path.
>
>> We do the check because get_user_pages() really pin and locked
>> the memory.
>
>Don't do this. Performance will be bad.
>Do the check once in ioctl and increment locked_vm by max amount you will use.
>On data path just make sure you do not exceed what userspace told you
>to.

What's in my mind is that in the ioctl which to get the memory locked needed to run smoothly,
it just return a value of how much memory is needed by mp device.
And then in the ioctl which to set the memory locked by user space, it check the rlimit and
increment locked_vm by user want. But I'm not sure how to "make sure do not exceed what
userspace told to". If we don't check locked_vm, what do we use to check? And Is it another kind of check on data path?

>
>>
>> >> So set operations should be in
>> >> vhost process too, it's natural.
>> >>
>> >> >So I think we'll need another ioctl in the backend
>> >> >to tell userspace how much memory is needed?
>> >> >
>> >> Except vhost tells it to mp device, mp did not know
>> >> how much memory is needed to run zero-copy smoothly.
>> >> Is userspace interested about the memory mp is needed?
>> >
>> >Couldn't parse this last question.
>> >I think userspace generally does want control over
>> >how much memory we'll lock. We should not just lock
>> >as much as we can.
>> >
>> >--
>> >MST
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/