Reverse diff version control

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Some (most?) version control systems use a reverse diff storage system to keep storage requirements down to a minimum, at the expense of retrieval time for old versions.

The idea is pretty simple: you always keep a full copy of the current version of a file -- let's call it version N. But instead of keeping a full copy of version N - 1, you just keep the difference between that copy and version N. Similarly, for version N - 2, you keep the difference between N - 2 and N - 1.

If differences between versions are significantly smaller than the versions themselves, there's a great reduction in storage requirements for the entire system.

To generate a full copy of version N - 2, you have to first retrieve version N, then the diff between version N - 1 and version N, and apply that patch to make version N - 1, then retrieve the diff between version N - 2 and version N - 1, and apply that patch to make version N - 2. The idea, though, is that you retrieve old versions much, much less frequently than the current version.

You could conceivably store all old versions as diffs against the current version -- making retrieval time faster -- but it makes storing slower, since you have to recompute the diffs each time you save.

Another optimization is keeping an LRU cache of old versions that have been retrieved.

MediaWiki 1.2 does compression on old versions, but diffs would probably take even less space. Even if compression reduces the size of old versions to, say, 10% of their original size, storing a diff that's just 1% of the old version is even more of an improvement. If storage is at a premium, you could even compress the diffs... although this probably wouldn't pay off for very small diffs.