Re: [rfc] git: combo-blobs

From: Ingo Molnar
Date: Mon Apr 11 2005 - 10:13:26 EST



* Paul Jackson <pj@xxxxxxxxxxxx> wrote:

> Hmmm ... I have this strong sense that I am about 2 hours away from
> smacking my forehead and groaning "Duh - so that's what Ingo meant!"
>
> However, one must play out one's destiny.
>
> Could you provide an example scenario, which results in the creation
> of a combo-blob?
>
> The best I can come up with is the following.
>
> Let's say Nick changes one line in the middle of kernel/sched.c (yeah
> - I know - unlikely scenario - he usually changes more than that -
> nevermind that detail.)
>
> In the days Before Combo Blobs (BCB), git would have been told that
> kernel/sched.c was to be picked up, and would have wrapped it up in a
> zlib'd blob, sha1summed it, seen it was a new sum, and added that blob
> to its objects (or something like this -- I'm still a little fuzzy on
> these git details.)
>
> But Nick just downloaded the latest git 1.5.11.1 which has added
> support for combo blobs, so now, guessing here, instead of wrapping up
> the new sched.c, git instead unwraps the old one, diff's with the new,
> notices a couple of long sequences that are unchanged, wraps up both
> of those sequences as a couple of relatively large blobs, and wraps up
> the new lines that Nick just coded in the middle as a small blob, and
> puts all three in the object store, along with another small
> combo-blob, tying them all together.

actually, git would just include by reference the previous blob.

lets say we had the previous version of sched.c in a blob, ID
cc4ee6107d19f89898a8c89d45810f01710f2ff4. We have the new edit (which is
small, lets say 20 bytes) in blob e010fab710092b19be6e26de1721e249dff2d141.
We'd create the combo-blob representing the new version of sched.c, the
following way:

include cc4ee6107d19f89898a8c89d45810f01710f2ff4 0 54010
include e010fab710092b19be6e26de1721e249dff2d141 0 20
include cc4ee6107d19f89898a8c89d45810f01710f2ff4 54030 73061

so we'd include (by reference) most of the previous version, with a
small blob for the extras. Since sched.c compresses down to 36K, we
saved ~32K of bandwidth, and somewhere on the order of 20K of storage.

to construct the combo blob later on, we do have to unpack sched.c (and
if it's already a combo-blob that is not cached then we'd have to unpack
all parents until we arrive at some full blob).

> So far, not too bad. Haven't gained anything, and required the
> unpacking of a zlib blog we didn't require before, and the running and
> analyzing of a diff we didn't require before, but the end result is
> only moderately worse - four object blobs instead of one, but of total
> size not much larger (well, total size typically 3 disk blocks worse,
> due to a slight increase in fragmentation from using 4 blocks to store
> what used to be in one.)

we'd have 2 new objects (the 'delta' and the 'combo' blob).

(if # of objects is an issue then we could include new data in the combo
blob itself too, but that's getting too complex i think.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/