[SCM saga] An alternative to git

From: Eli
Date: Wed Apr 13 2005 - 08:33:40 EST


[I sent this from my gmail account last night, but it did not appear to make
it to the list, so I'm re-sending. Sorry if you get this twice.]

Linus,

Background:
-----------
The SCM problem is something I've been toying with for a while now on my own
because I wanted a distributed SCM tool, but didn't like one thing or another
about the existing projects. I thought monotone came very close to what I
wanted, but I didn't like the repository as a binary blob. So I've been
playing with my own prototype SCM project. I'd like to know what you think
of this approach, and maybe observations on this prototype implementation.

The Repository:
---------------
The on-disk format looks like this:
linux-repository

linux_kernel_repo/
|-- cache
|-- nodes ("revisions" would be a better name)
|-- patches
`-- tags

Files are created in the repository, and never modified once they are there.
Files are never deleted from the repository (with the possible exception of
stuff in the cache directory).

The patches directory contains <hashofself>.patch files, which are basically
[comments]
[diffstat]
[diff -aurN]

The nodes directory contains <hashofself>.node files, which have
[author] [date]
[<hash>.node <hash>.patch]
... (There can be 0 or more [node,patch] tuples.)

The tags directory contains files named like this:
1113315703.63-eli@xxxxxxxxxxxxxxxxxxxxxxx
(i.e: <epoch time>-<author>-<tag-name>.tag)
that contain the name of a node: <hash>.node

Then finally, there is the cache directory. With my current implementation,
this area is used to cache unpacked revisions of the source tree for diffing
and the like. The contents of these directories are hard-linked to conserve
some space.

Repository Implications:
------------------------
This repository layout allows for a number of neat features.
Like git, you can use rsync to push and pull everything from one repository to
another. (You'll probably want to exclude (most of) the cache directory.)
But even better than that, you can pull a cached node from a repository into a
new repository, (such as linux-2.6.11) then pull your current head revision
and patches tracking back to that node, giving you a minimal repository with
just the history you want. Then when you're happy with what you have, push
it out.
That would let you have every revision of Linux known to man in a public
repository somewhere (kernel.org?), but you could keep a much reduced copy of
the repository on your working machine.
And we can do all of that without writing a dedicated SCM server: ftp, sftp,
rsync, etc. should all work.

Since the patches don't reference a node in themselves, they can be applied to
nodes other than the one they were created from. I believe this will allow
merging to work better than git would allow because we can track that the
same patch was applied to both of two lines of development that we're trying
to merge.

Current performance isn't very good, but I haven't really optimized it yet,
either. The numbers below are from commits to my repository on a 2.6GHz
Celeron w/768MB RAM.

Commit 2.6.0 95.31user 354.33system 9:07.91elapsed 82%CPU
Commit 2.6.1 61.53user 139.10system 6:01.20elapsed 55%CPU
Commit 2.6.2 21.48user 32.11system 1:02.36elapsed 85%CPU
Commit 2.6.3 22.40user 31.51system 1:00.45elapsed 89%CPU
Commit 2.6.4 26.41user 42.79system 1:17.18elapsed 89%CPU
Commit 2.6.5 24.16user 38.97system 1:13.21elapsed 86%CPU
Commit 2.6.6 25.37user 38.99system 1:15.28elapsed 85%CPU
Commit 2.6.7 35.83user 55.06system 1:46.76elapsed 85%CPU
Commit 2.6.8 43.94user 70.02system 2:27.96elapsed 77%CPU
Commit 2.6.9 43.90user 75.02system 2:42.44elapsed 73%CPU
Commit 2.6.10 52.16user 89.46system 3:03.61elapsed 77%CPU
Commit 2.6.11 54.47user 93.21system 3:25.57elapsed 71%CPU
Commit 2.6.11.1 14.06user 29.82system 1:04.60elapsed 67%CPU
Commit 2.6.11.2 5.42user 12.66system 0:32.30elapsed 56%CPU
Commit 2.6.11.3 1.91user 1.35system 0:06.49elapsed 50%CPU
Commit 2.6.11.4 1.82user 1.41system 0:06.84elapsed 47%CPU
Commit 2.6.11.5 1.91user 1.31system 0:04.26elapsed 75%CPU
Commit 2.6.11.6 1.89user 1.35system 0:06.94elapsed 46%CPU
Commit 2.6.11.7 2.07user 1.50system 0:07.44elapsed 48%CPU

Doing some profiling leads me to believe that there is a lot of room for
improvement for the nothing -> 2.6.0 and the 2.6.0 -> 2.6.1 commits. I
believe the 2.6.x commits can also be improved.
The 2.6.11.x commits tend to be <10 seconds. Which should be your common
case.
(Yes, I know git is blazingly fast, but are these numbers reasonable? How far
would they need to go to be reasonable? I think these are comparable to what
monotone would do, but they've done some optimization recently, so I'm
probably slower.)

As for disk usage (du -sck) of the above repository:
152 tags
168 nodes
354572 patches
Those three directories contain all the history.
1379704 cache
And that can be re-created from the nodes+patches.
There is no compression going on here, which would be an easy way to improve
that. I haven't implemented that yet though.

The Sandbox:
------------
At the top directory of the sandbox, I have a .scm directory containing state
information such as current revision and repository location. The rest is
just source code. Pretty straight-forward.

Efficiency/Optimizations:
-------------------------
I used some of the ideas from the git discussion in this implementation.
The sandbox keeps some cached information in the .scm; the file stat
information, and a current hash of each file. That way it can look at only
the files it needs to for committing, diffing, etc.
(Sadly, this may not have helped as much as I expected. I have a version of
my tool that just does a straight recursive diff on the whole thing, and the
times are comparable. That may just point to a really dumb bug on my part
though.)
I do a lot of os.system()'s and os.popen()'s that could probably be redone.
Doing a quick grep, it looks like I use: diff, sed, cp, filterdiff, diffstat,
grep, tail, patch, find, xargs, stat, sort, sha1sum, rsync and $EDITOR. (In
my defense, I was aiming for speed of implementation--after all, git is
progressing quite quickly.)
It might make sense to glue this together with git -- use your "content
addressable file system" for the cache directory to get the speed, but use
the the nodes and patches to record the history.

Implementation:
---------------
Attached is scm.py (pronounced "skimpy"), which as the name suggests, is a
python implementation of the SCM system I've described. I'm using Python
2.3.4 on Fedora Core 3, and I haven't tested this on other
distributions/platforms/etc. (This is all original code I've done on my own
time and equipment, and I'm releasing it under the GPL-v2-or-later, with
absolutely no warranty, and the warning that it may kill your dog, wreck your
car, etc.)

There are a number of known issues:
The program is a little user-hostile--when it runs into something unexpected,
it's liable to raise an exception, print a back-trace, and exit. Hopefully
it'll give a useful hint in the exception. It may even leave something
in /tmp, though if it does, that's a bug I'd like to hear about.
Sometimes you'll get prompted by patch when scm calls out to it when it really
should be set up to run non-interactively.
I have not dealt with file renames yet, but I expect that to become part of
the patch "comment" area, much like the diffstat is. I have not decided how
to best represent that.
I have not implemented merging yet, but the groundwork is there. I expect to
do a breadth-first search for a common ancestor, find the patches in the line
of development you are merging into your own that you do not already have in
your own, apply them to your line of development creating nodes along the
way, then let you deal with the broken parts. As a first pass anyway.
I haven't directly addressed file permissions or directories yet. I'm
expecting those would wind up being handled similarly to file renames in the
patch files.
If we want to add arbitrary attributes for patches and/or nodes, we could
create a "attr" directory containing a directory named after the node or
patch, and in that, a file like <epoch>-<author>-<attribute-name> containing
the value of that attribute. For that matter, we could re-do tags that way.

Quick Start:
------------
./scm.py help
./scm.py init <path-to-new-repository>
./scm.py --repo=<repository> checkout mysandbox
cd mysandbox
<copy your source in>
./scm.py diff
./scm.py diffstat
./scm.py commit
And this one is fun:
./scm.py --repo=<repository> graph | dot -Tps repo-history.ps

Main Points:
------------
I believe that this representation of the kernel history is better than git's
because I believe it will improve our ability to do merges and more closely
represents what is occurring.
I believe that the repository design will give us many of the same benefits as
git, and some that git does not give us.
I believe that the implementation can be significantly improved.
The performance scales primarily by the size of the change, rather than the
size of the code base or the size of the repository.


Ok, so that's scm.py. Now, let me lock my ego in this nicely padded safe, don
the asbestos coveralls, and cower behind this nice sturdy boulder....

Alright, pull out the flame throwers and have at it! :)

Eli
-----------------------. "If it ain't broke now,
Eli Carter \ it will be soon." -- crypto-gram
eli.carter@xxxxxxxxxxxxx `--------------------------------------------

Attachment: scm.py
Description: application/python