Re: [patch 05/11] syslets: core code

From: Zach Brown
Date: Thu Feb 15 2007 - 14:23:08 EST


2) On the client facing side (port 53), I'd very much hope for a way to
do 'recvv' on datagram sockets, so I can retrieve a whole bunch of
UDP datagrams with only one kernel transition.

I want to highlight this point that Bert is making.

Whenever we talk about AIO and kernel threads some folks are rightly concerned that we're talking about a thread *per IO* and fear that memory consumption will be fatal.

Take the case of userspace which implements what we'd think of as page cache writeback. (*coughs, points at email address*). It wants to issue thousands of IOs to disjoint regions of a file. "Thousands of kernel threads, oh crap!"

But it only issues each IO with a separate syscall (or io_submit() op) because it doesn't have an interface that lets it specify IOs that vector user memory addresses *and file position*.

If we had a seemingly obvious interface that let it kick off batched IOs to different parts of the file, the looming disaster of a thread per IO vanishes in that case.

struct off_vec {
off_t pos;
size_t len;
};

long sys_sgwrite(int fd, struct iovec *memvec, size_t mv_count,
struct off_vec *ovec, size_t ov_count);

It doesn't take long to imagine other uses for this that are less exotic.

Take e2fsck and its iterating through indirect blocks or directory data blocks. It has a list of disjoint file regions (blocks) it wants to read, but it does them serially to keep the code from getting even more confusing. blktrace a clean e2fsck -f some time.. it's leaving *HALF* of the disk read bandwith on the table by performing serial block-sized reads. If it could specify batches of them the code would still be simple but it could tell the kernel and IO scheduler *exactly* what it wants, without having to mess around with sys_readahead() or AIO or any of that junk :).

Anyway, that's just something that's been on my mind. If there are obvious clean opportunities to get more done with single syscalls, it might not be such a bad thing.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/