Re: [GIT PULL] ocfs2 changes for 2.6.32

From: Linus Torvalds
Date: Thu Sep 17 2009 - 12:29:29 EST




On Tue, 15 Sep 2009, Joel Becker wrote:
>
> Ok. Where do you see the exposure level? What I mean is, I
> just defined a vfs op that handles these things, but accessed it via two
> syscalls, sys_snapfile() and sys_copyfile(). We could also just provide
> one system call and allow userspace to use these flags itself, creating
> snapfile(3) and copyfile(3) in libc

Why would anybody want to hide it at all? Why even the libc hiding?

Nobody is going to use this except for special apps. Let them see what
they can do, in all its glory.

> > I still worry that especially the non-atomic case will want some kind of
> > partial-copy updates (think graphical file managers that want to show the
> > progress of the copy), and that (think EINTR and continuing) makes me
> > think "that could get really complex really quickly", but that's something
> > that the NFS/SMB people would have to pipe up on. I'm pretty sure the NFS
> > spec has some kind "partial completion notification" model, I dunno about
> > SMB.
>
> I'm really wary of combining a ranged interface with this one.
> Not only does it make no sense for snapshots, but I think it falls down
> in any "create a new inode" scheme entirely.

Oh, I wouldn't suggest a ranged interface, just one that allows for status
updates and cancelling - _if_ the initial op isn't atomic to begin with.
There's also the issue of concurrency in IO: maybe you want to start
several things without necessarily waiting for them (think high-throughput
"cp -R" on NFS or something like that).

So I'd suggest something like having two system calls: one to start the
operation, and one to control it. And for a filesystem that does atomic
copies, the 'start' one obviously would also finish it, so the 'control'
it would be a no-op, because there would never be any outstanding ones.

See what I'm saying? It wouldn't complicate _your_ life, but it would
allow for filesystems that can't do it atomically (or even quickly).

So the first one would be something like

int copyfile(const char *src, const char *dest, unsigned long flags);

which would return:

- zero on success
- negative (with errno) on error
- positive cookie on "I started it, here's my cookie". For extra bonus
points, maybe the cookie would actually be a file descriptor (for
poll/select users), but it would _not_ be a file descriptor to the
resulting _file_, it would literally be a "cookie" to the actual
copyfile event.

and then for ocfs2 you'd never return positive cookies. You'd never have
to worry about it.

Then the second interface would be something like

int copyfile_ctrl(long cookie, unsigned long cmd);

where you'd just have some way to wait for completion and ask how much has
been copied. The 'cmd' would be some set of 'cancel', 'status' or
'uninterruptible wait' or whatever, and the return value would again be

- negative (with errno) for errors (copy failed) - cookie released
- zero for 'done' - cookie released
- positive for 'percent remaining' or whatever - cookie still valid

and this would be another callback into the filesystem code, but you'd
never have to worry about it, since you'd never see it (just leave it
NULL).

NOTE! The above is a rough idea - I have not spent tons of time thinking
about it, or looking at exactly what something like NFS would really want.
But the _concept_ is simple, and usage should be pretty trivial. A simple
case would be something like this:

int copy_file(const char *src, const char *dst)
{
/* Start a file copy */
int cookie = copyfile(src, dst, 0);

/* Async case? */
if (cookie > 0) {
int ret;

while ((ret = copyfile_ctrl(cookie, COPYFILE_WAIT)) > 0)
/* nothing */;

/* Error handling is shared for async/sync */
cookie = ret;
}
if (cookie < 0) {
perror("copyfile failed");
return -1;
}
return 0;
}

doesn't that look fairly easy to use?

And the advantage here is that you _can_ - still fairly easily - do much
more involved things. For example, let's say that you wanted to do a very
efficient parallel copy, so you'd do something like this:

#define MAX_PEND 10
static int pending[MAX_PEND];
static int nr_pending = 0;

static int wait_for_completion(int nr_left)
{
int ret;

while (nr_pending > nr_left) {
int cookie = pending[0], i;

/* Wait for completion of the oldest entry */
while ((i = copyfile_ctrl(cookie, COPYFILE_WAIT)) > 0)
/* nothing */;

/* Save the "we had an error" case */
if (i < 0)
ret = i;

/* Move the other entries down */
memmove(pending, pending+1, sizeof(int)*--nr_pending);
}
return ret;
}

int start_copy(src, dst)
{
int cookie, ret;

cookie = copyfile(src, dst, 0);
if (cookie <= 0)
return cookie;

ret = 0;
if (nr_pending == MAX_PENDING)
ret = wait_for_completion(pending, MAX_PENDING/2);

pending[nr_pending++] = cookie;
return ret;
}

int stop_copy(void)
{
return wait_for_completion(pending, 0);
}

which basically ends up having ten copyfile() calls outstanding (and when
we hit the limit, we wait for half of them to complete), so now you can do
an efficient "cp -R" with concurrent server-side IO. And it wasn't so
hard, was it?

(Ok, so the above would need to be fleshed out to remember the filenames
so that you can report _which_ file failed etc, but you get the idea).

And again, it wouldn't be any more complicated for your case. Your
copyfile would always just return 0 or negative for error. But it would be
_way_ more powerful for filesystems that want to do potentially lots of IO
for the file copy.

I dunno. The above seems like a fairly simple and powerful interface, and
I _think_ it would be ok for NFS and CIFS. And in fact, if that whole
"background copy" ends up being used a lot, maybe even a local filesystem
would implement it just to get easy overlapping IO - even if it would just
be a trivial common wrapper function that says "start a thread to do a
trivial manual copy".

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/