Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

From: Peter Zijlstra
Date: Thu Aug 17 2006 - 15:14:55 EST

Next message: Paul Jackson: "Re: [patch] sched: group CPU power setup cleanup"
Previous message: john stultz: "Re: Linux time code"
In reply to: Evgeniy Polyakov: "Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD"
Next in thread: Evgeniy Polyakov: "Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, 2006-08-17 at 22:42 +0400, Evgeniy Polyakov wrote:
> On Thu, Aug 17, 2006 at 11:01:52AM -0700, Daniel Phillips (phillips@xxxxxxxxxx) wrote:
> > *** The system is not OOM, it is in reclaim, a transient condition ***
>
> It does not matter how condition when not every user can get memory is
> called. And actually no one can know in advance how long it will be.

Well, we have a direct influence here, we're working on code-paths that
are called from reclaim. If we stall, the box is dead.

> > Which Peter already told you! Please, let's get our facts straight,
> > then we can look intelligently at whether your work is appropriate to
> > solve the block IO network starvation problem that Peter and I are
> > working on.
>
> I got openssh as example of situation when system does not know in
> advance, what sockets must be marked as critical.
> OpenSSH works with network and unix sockets in parallel, so you need to
> hack openssh code to be able to allow it to use reserve when there is
> not enough memory.

OpenSSH or any other user-space program will never ever have the problem
we are trying to solve. Nor could it be fixed the way we are solving it,
user-space programs can be stalled indefinite. We are concerned with
kernel services, and the continued well being of the kernel, not
user-space. (well therefore indirectly also user-space of course)

> Actually all sockets must be able to get data, since
> kernel can not diffirentiate between telnetd and a process which will
> receive an ack for swapped page or other significant information.

Oh, but it does, the kernel itself controls those sockets: NBD / iSCSI
and AoE are all kernel services, not user-space. And it is the core of
our work to provide this information to the kernel; to distinguish these
few critical sockets.

> So network must behave separately from main allocator in that period of
> time, but since it is possible that reserve can be not filled or has not
> enough space or something other, it must be preallocated in far advance
> and should be quite big, but then why netwrok should use it at all, when
> being separated from main allocations solves the problem?

You still need to guarantee data-paths to these services, and you need
to make absolutely sure that your last bit of memory is used to service
these critical sockets, not some random blocked user-space process.

You cannot pre-allocate enough memory _ever_ to satisfy the total
capacity of the network stack. You _can_ allot a certain amount of
memory to the network stack (avoiding DoS), and drop packets once you
exceed that. But still, you need to make sure these few critical
_kernel_ services get their data.

> I do not argue that your approach is bad or does not solve the problem,
> I'm just trying to show that further evolution of that idea eventually
> ends up in separated allocator (as long as all most robust systems
> separate operations), which can improve things in a lot of other sides
> too.

Not a separate allocator per-se, separate socket group, they are
serviced by the kernel, they will never refuse to process data, and it
is critical for the continued well-being of your kernel that they get
their data.

Also, I do not think people would like it if say 100M of their 1G system
just disappears, never to used again for eg. page-cache in periods of
low network traffic.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Paul Jackson: "Re: [patch] sched: group CPU power setup cleanup"
Previous message: john stultz: "Re: Linux time code"
In reply to: Evgeniy Polyakov: "Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD"
Next in thread: Evgeniy Polyakov: "Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]