Re: [patch 1/2] fork_connector: add a fork connector

From: Evgeniy Polyakov
Date: Tue Mar 29 2005 - 02:11:42 EST


On Mon, 2005-03-28 at 13:42 -0800, Paul Jackson wrote:
> Guillaume wrote:
> > The lmbench shows that the overhead (the construction and the sending
> > of the message) in the fork() routine is around 7%.
>
> Thanks for including the numbers. The 7% seems a bit costly, for a bit
> more accounting information. Perhaps dean's suggestion, to not use
> ascii, will help. I hope so, though I doubt it will make a huge
> difference. Was this 7% loss with or without a user level program
> consuming the sent messages? I would think that the number of interest
> would include a minimal consumer task.

There is no overhead at all using CBUS.
On my old P2/256mb SMP machine it took about 950 usec
to create+exit process both with fork connector turned on and
without it even compiled.
Direct connector's method call took about 1000-1100 usec.
Current fork connector does not use CBUS [yet, I hope].

> I don't see a good reason to make fork_connector() inline. Since it
> calls other subroutines and is not just a few lines, perhaps better to
> make it a real routine, so we can see it in "nm --print-size" output and
> debug stacks.
>
> Having the "#ifdef CONFIG_FORK_CONNECTOR" chunk of code right in fork.c
> seems unfortunate. Can the real fork_connector() be put elsewhere, and
> the ifdef put in a header file that makes it a no-op if not configured,
> or simply a function declaration, if configured?
>
> What's the status of the connector driver patch? I perhaps wasn't
> paying close enough attention, but all I see of it now is a couple of
> patches sent to lkml, from Evgeniy Polyakov, in September and January.
> I don't see it in my copies of *-mm or recent Linus bk trees. Am I
> missing something?

It was dropped from -mm tree, since bk tree where it lives
was in maintenance mode.
I think connector will be appeared in the next -mm release.

> This still seems to me like more apparatus than is desirable, just to
> get another form of session id, as best as I can figure it. However
> we've already been there, and apparently my concerns were not
> persuasive. If one does go down this path, then using this connector
> patch is a good an alternative as any I know of. Well, that or relayfs.
> My uneducated assumption is that relayfs might at least batch data
> packets up into big buffer chunks better, but someone more knowledgeable
> than me needs to consider that.
>
> It's a little sad, when almost all the required accounting information
> comes out in packed 64 byte records, carefully buffered and sent in
> big chunks, to minimize per-task costs. Then this one extra detail,
> of <parent-pid, child-pid> requires an entire netlink packet of
> its own of what size -- another 50 or 100 bytes? Is this packet
> received as a separate data packet, on its own recv(2) system call,
> by the user task, not in a big block of packets? The efficiency
> of getting this one extra <parent-pid, child-pid> out of the kernel
> seems to be one or two orders of magnitude worse than the rest of
> the accounting data.

It can be easily changed.
One may send kernel/acct.c acct_t structure out of the kernel -
overhead will be the same: kmalloc probably will get new area from the
same 256-bytes pool, skb is still in cache.

> ===
>
> Hmmm ... perhaps one could add a _second_ accounting file, cutting and
> pasting code in kernel/acct.c and enabling writing additional
> information to that second file, using the same mechanisms as now used
> for the primary file. Use a more extensible record format for the
> second file (say start each record with a magic cookie, a byte record
> type and a byte record length, then that many bytes). That way, we have
> an escape valve for adding additional record types in the future.
> And that way we can efficiently write short records, with just say
> a couple of interesting values, and minimal overhead.
>
> Don't worry if the magic cookie appears as part of the raw data. If one
> has to resync such a data stream, one can look for a series of records,
> each starting with the magic cookie, sensible record type byte, and a
> length that ends right at the next such valid record. The occassional
> duplication of the same cookie within the data stream would not thwart a
> resync for long. And the main purpose of the magic cookie is to make
> sure you are still in sync, not reverting to garbage-in, garbage-out,
> mode. Almost any magic value other than 0x0000 will suffice for that
> purpose.
>
> I just ran a silly little test on my PC desktop Linux box, scanning
> /proc/kcore. The _least_ common 2 byte word seen was 0x2B91, with 31
> instances in a half-billion words scanned, so I nominate that value for
> the magic cookie ;).
>
> The key reason that it might make sense here to adapt the existing
> accounting file direct write mechanism, rather than using "connector" or
> "relayfs", is that we really do want to get this data to disk initially.
> Relayfs is optimized for getting alot of data to a user daemon, and the
> connector for sending smaller packets of data to a user daemon. But
> accounting processing is sometimes done out of a cron job off-hours.
> During the day (the busy hours) you might just want to stash the stuff
> with as little performance impact is possible. If one can avoid _any_
> other task having to context switch in, in order to get this data on its
> way, that is a huge win.

File writing accounting [kernel/acct.c] is slower, it takes global
locks
and requires process' context to work with system calls.
realyfs is interesting project, but it has different aims,
as far as I can see, - it is created for transferring huge amounts of
data,
and it is succeded in it, while connector is purely
control/notification
mechanism, for example, for gathering short-living per-process
accounting data.

--
Evgeniy Polyakov

Crash is better than data corruption -- Arthur Grabowski

Attachment: signature.asc
Description: This is a digitally signed message part