Re: BK kernel workflow

From: Roman Zippel
Date: Thu Oct 28 2004 - 16:17:59 EST


Hi,

On Wed, 27 Oct 2004, Larry McVoy wrote:

> Warning: long message. Reason you should read it: because I show you
> how you can independently verify, without a BK license, that Roman is
> spreading FUD. So read this, check it out, you'll see that this is
> all Roman's nonsense. If you already believe that, you can skip this
> but there is a lot of useful BK2CVS data in this message.

I have to apologize, there is a little more data in cvs repository, but it
doesn't change in any way what I was trying to tell Linus.
It seems I made another mistake, I have to apologize for. I didn't realize
that the kernel repository is a private party and you are the bully at the
door. Well, it seems I'm not cool enough and I deeply apologize for
bothering you guys.

All I have left to say is that, what I said before is everything but
nonsense and I invite everyone to verify what I say and whether Larry
really tells the complete story.
So for example the 56% number is very real:

$ rlog ChangeSet,v | grep BKrev: | wc -l
23461

The number of the changesets in the bk repository is around 53000, this is
the difference I talked about and it's real. What does this mean? The bk
repository contains around 53000 snaphots of kernel development history,
44% of these snapshots can be restored via bkcvs. So where is the rest?

bk preserves the history nonlinearly, e.g. something like this:

t1 -> t2 -> t3 -> t4 -> t5
+---> t1.1 -> t1.2 ---^

This means someone cloned the repo at t1 and committed some changes at
t1.1 and t1.2, in the meantime the changes t2, t3, t4 were applied to the
parent repo and at t5 the changes from the cloned repo were merged.

CVS by itself can't of course represent this history, so it only contains
the changes from t1, t2, t3, t4, t5, the snapshots t1.1, t1.2 cannot be
restored anymore reliably from bkcvs, one can only get the merged
changeset t5. As it's rather likely that every commit changes different
files, one can actually get 100% of the file changes, but you still have
in this case only 71% percent of the commits. Why would anyone be
interested in the remaining 29%? Only with the complete information one go
back to any previous snapshot, so if e.g. something went wrong during the
merge in t5, it's quite easy to tell that it still worked at t4 and t1.2.
Without the extra information one only knows that it went wrong after t4.
Applied to bkcvs this means a problem can be located with a precision of
44% and this number is still quite high as I mentioned in the last mail.
The more bk users the less this number becomes. Let's say 5 people set up
a repo with 20 changes each to let Linus pull from them, instead of
sending them to Andrew, so they can end up as just 24 commits in bkcvs.
The little grepping I did on ChangeSet,v seems to backup my theory (if
someone knows how to get real numbers, I'd love to hear them), I looked at
pulls from the net, scsi and Greg's repo and most of them are indeed
merged into a single bkcvs commit.

On the other hand Larry's 96% number sounds impressive, but it's the less
interesting one, if some of the commit information is missing, it's
nontrival to put the file changes in the correct context. In the example
above this means the individual file patches of t1.1 and t1.2 have to be
matched to the corresponding splitted changelog of t5, it's anything but
trivial, if it's possible at all and it doesn't provide the information
that t1.1 has to be applied on top of t1. This means that it's often
possible to extract a changeset manually, but doing it automatically is
rather impractical and not worth the trouble.

Now it's theoretically possible to extract data via bkweb, but I (and I
hope not only me) still remember what happened last time Andrea tried
that, so I haven't actually looked into it too deeply. It certainly is
possible to extract all commits, the problem starts with putting them into
a context. The repository information in bkweb is a bit filtered (e.g.
empty merges), which are of only little interest to the human viewer, but
that makes it interesting to reliably tell where a branch starts and end.
Without further research I can only guess, whether that information is
deducible or has to be extrapolated.

Finally the commit mails have really only informational value, as they
lack any usable context.

Well, that's about it about the usefulness of the public available
information. So if anyone could tell me, what exactly is FUD about this,
I certainly would love to know it.

So why do I actually care about this data? For one I have that weird idea
that the goals of a free software project should be reflected in its
development process, that I actually get insulted for this is a truly
strange experience.
I would really like to have access to this data, it's data I contributed
to, it's data I care about and which I'm working on a lot of time anyway,
but unfortunately full access to this data is restricted to a private
club. Consequently this means that there will never be a gateway to easily
exchange data, if synchronization is limited to bkcvs, this will cause
unnecessary conflicts. If the information has to go via
scm->patch->bk->bkcvs->scm, some information obviously get lost and the
scm can only guess whether the information has been merged or not. IOW an
alternative scm will always be in a disadvantage...

Anyway, I will now go away and play with the other uncool kids.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/