Re: [RFC PATCH v2 1/1] pipe: busy wait for pipe

From: Subhra Mazumdar
Date: Mon Nov 05 2018 - 18:41:00 EST

Next message: Tim Chen: "Re: [Patch v4 17/18] x86/speculation: Update SPEC_CTRL MSRs of remote CPUs"
Previous message: Lu Baolu: "Re: [PATCH v4 6/8] vfio/mdev: Add iommu place holders in mdev_device"
In reply to: Mel Gorman: "Re: [RFC PATCH v2 1/1] pipe: busy wait for pipe"
Next in thread: Mel Gorman: "Re: [RFC PATCH v2 1/1] pipe: busy wait for pipe"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 11/5/18 2:08 AM, Mel Gorman wrote:

Adding Al Viro as per get_maintainers.pl.

On Tue, Sep 25, 2018 at 04:32:40PM -0700, subhra mazumdar wrote:

Introduce pipe_ll_usec field for pipes that indicates the amount of micro
seconds a thread should spin if pipe is empty or full before sleeping. This
is similar to network sockets.

Can you point to what pattern from network sockets you are duplicated
exactly? One would assume it's busy_loop_current_time and busy_loop_timeout
but it should be in the changelog because there are differences in polling
depending on where you are in the network subsystem (which I'm not very
familiar with).

I was referring to the sk_busy_loop_timeout() that uses sk_ll_usec. By
similar I meant having a similar mechanism for pipes to busy wait

Workloads like hackbench in pipe mode
benefits significantly from this by avoiding the sleep and wakeup overhead.
Other similar usecases can benefit. A tunable pipe_busy_poll is introduced
to enable or disable busy waiting via /proc. The value of it specifies the
amount of spin in microseconds. Default value is 0 indicating no spin.

Your lead mail indicates the spin was set to a "suitable spin time". How
should an administrator select this spin time? What works for hackbench
might not be suitable for another workload. What if the spin time
selected happens to be just slightly longer than the time it takes the
reader to respond? In such a case, the result would be "all spin, no
gain". While networking potentially suffers the same problem, it appears
to be opt-in per socket so it's up to the application not to shoot
itself in the foot.

Even for network, sk_ll_usec is assigned the value of the tunable
sysctl_net_busy_read in sock_init_data() for all sockets initialized by
default. There is way for per socket setting using sock_setsockopt(), for
pipes that can be added later if needed in case of different apps running
in one system. But there are cases where only one app runs (e.g big DBs)
and one tunable will suffice. It can be set to a value that is tested to be
beneficial under the operating conditions.

Signed-off-by: subhra mazumdar <subhra.mazumdar@xxxxxxxxxx>
---
fs/pipe.c | 12 ++++++++++++
include/linux/pipe_fs_i.h | 2 ++
kernel/sysctl.c | 7 +++++++
3 files changed, 21 insertions(+)

diff --git a/fs/pipe.c b/fs/pipe.c
index bdc5d3c..35d805b 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -26,6 +26,7 @@
#include <linux/uaccess.h>
#include <asm/ioctls.h>
+#include <linux/sched/clock.h>
#include "internal.h"
@@ -40,6 +41,7 @@ unsigned int pipe_max_size = 1048576;
*/
unsigned long pipe_user_pages_hard;
unsigned long pipe_user_pages_soft = PIPE_DEF_BUFFERS * INR_OPEN_CUR;
+unsigned int pipe_busy_poll;
/*
* We use a start+len construction, which provides full use of the
@@ -106,6 +108,7 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
void pipe_wait(struct pipe_inode_info *pipe)
{
DEFINE_WAIT(wait);
+ u64 start;
/*
* Pipes are system-local resources, so sleeping on them
@@ -113,6 +116,10 @@ void pipe_wait(struct pipe_inode_info *pipe)
*/
prepare_to_wait(&pipe->wait, &wait, TASK_INTERRUPTIBLE);
pipe_unlock(pipe);
+ start = local_clock();
+ while (current->state != TASK_RUNNING &&
+ ((local_clock() - start) >> 10) < pipe->pipe_ll_usec)
+ cpu_relax();
schedule();
finish_wait(&pipe->wait, &wait);
pipe_lock(pipe);

Networking breaks this out better in terms of options instead of
hard-coding. This does not handle need_resched or signal delivery
properly where as networking does for example.

I don't disable preemption, so don't think checking need_resched is needed.
Can you point to what you mean by handling signal delivery in case of
networking? Not sure what I am missing. My initial version broke it out
like networking but after Peter's suggestion I clubbed it. I don't feel
strongly either way.

@@ -825,6 +832,7 @@ static int do_pipe2(int __user *fildes, int flags)
struct file *files[2];
int fd[2];
int error;
+ struct pipe_inode_info *pipe;
error = __do_pipe_flags(fd, files, flags);
if (!error) {
@@ -838,6 +846,10 @@ static int do_pipe2(int __user *fildes, int flags)
fd_install(fd[0], files[0]);
fd_install(fd[1], files[1]);
}
+ pipe = files[0]->private_data;
+ pipe->pipe_ll_usec = pipe_busy_poll;
+ pipe = files[1]->private_data;
+ pipe->pipe_ll_usec = pipe_busy_poll;
}
return error;
}

You add a pipe field but the value in always based on the sysctl
so the information is redundantu (barring a race condition on one
pipe write per sysctl update which is an irrelevant corner case). In
comparison, the network subsystem appears to be explicitly opt-in via
setsockopt(SO_BUSY_POLL) from a glance and the value appears to be tunable
on a per-socket basis (I didn't check for sure, this is based on a glance
at what networking does). It's not clear what a sensible way of replicating
that for pipe file descriptors would be but it's possible that the only
way would be to define the system-wide tunable as a max spin time and try
detect the optimal spin time on a per-fd basis (not entirely dissimilar
to how cpuidle menu guesses how long it'll be in an idle state for).

As I said above even networking assigns the global tunable value to each
socket during initialization. Trying to detect an optimal spin time will
be extremely difficult, I had tried a lot of heuristics but doesn't quite
work under all workloads/utilization level. This change also doesn't
preclude such work in future. In mean time some workloads can benefit from
straight forward constant tunable setting.

It's not really my area but I feel that this patch is a benchmark-specific
hack and that tuning it on a system-wide basis will be a game of "win
some, lose some" that is never used in practice. Worse, it might end up
in a tuning guide as "always set this sysctl" without considering the
capabilities of the machine or the workload and falls victim to cargo
cult tuning.

It is 0 by default so no spin. Workloads that do set this should know what
they are doing and if it's beneficial. I will try DB OLTP workload to see
any benefits.

Thanks,
Subhra

Next message: Tim Chen: "Re: [Patch v4 17/18] x86/speculation: Update SPEC_CTRL MSRs of remote CPUs"
Previous message: Lu Baolu: "Re: [PATCH v4 6/8] vfio/mdev: Add iommu place holders in mdev_device"
In reply to: Mel Gorman: "Re: [RFC PATCH v2 1/1] pipe: busy wait for pipe"
Next in thread: Mel Gorman: "Re: [RFC PATCH v2 1/1] pipe: busy wait for pipe"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]