History compression
From Meta, a Wikimedia project coordination wiki
Various methods for history compression
Contents |
[edit] Concatenation and compression
Apparently this has been implemented (see http://www.ccc.de/congress/2004/fahrplan/event/63). Details are probably available somewhere.
./orig: total 93M -rw-rw-r-- 1 tstarlin wikidev 8.4k Jan 22 06:37 120.txt -rw-rw-r-- 1 tstarlin wikidev 1.7M Jan 22 06:37 atheism.txt -rw-rw-r-- 1 tstarlin wikidev 74M Jan 22 06:37 cleanup.txt -rw-rw-r-- 1 tstarlin wikidev 5.1k Jan 22 06:37 Ohio_Valley_Conference.txt -rw-rw-r-- 1 tstarlin wikidev 3.3M Jan 22 06:37 physics.txt -rw-rw-r-- 1 tstarlin wikidev 13M Jan 22 06:37 Talk:Daniel_C._Boyer.txt ./gzip9: total 15M -rw-rw-r-- 1 tstarlin wikidev 737 Jan 22 06:28 120.txt.gz -rw-rw-r-- 1 tstarlin wikidev 27k Jan 22 06:07 atheism.txt.gz -rw-rw-r-- 1 tstarlin wikidev 11M Jan 22 06:18 cleanup.txt.gz -rw-rw-r-- 1 tstarlin wikidev 802 Jan 22 06:26 Ohio_Valley_Conference.txt.gz -rw-rw-r-- 1 tstarlin wikidev 50k Jan 22 06:23 physics.txt.gz -rw-rw-r-- 1 tstarlin wikidev 2.5M Jan 22 06:32 Talk:Daniel_C._Boyer.txt.gz ./bzip2: total 2.8M -rw-rw-r-- 1 tstarlin wikidev 984 Jan 22 06:37 120.txt.bz2 -rw-rw-r-- 1 tstarlin wikidev 31k Jan 22 06:37 atheism.txt.bz2 -rw-rw-r-- 1 tstarlin wikidev 2.2M Jan 22 06:37 cleanup.txt.bz2 -rw-rw-r-- 1 tstarlin wikidev 984 Jan 22 06:37 Ohio_Valley_Conference.txt.bz2 -rw-rw-r-- 1 tstarlin wikidev 64k Jan 22 06:37 physics.txt.bz2 -rw-rw-r-- 1 tstarlin wikidev 480k Jan 22 06:37 Talk:Daniel_C._Boyer.txt.bz2
[edit] Timing on pliny
bunzip2
real 0m34.049s user 0m23.610s sys 0m1.650s
bzip2
real 1m40.663s user 1m25.720s sys 0m1.790s
[edit] Timing on unloaded 2.4 GHz Pentium 4
bzip2 * 98.70s user 0.97s system 57% cpu 2:52.51 total bunzip2 * 23.28s user 1.48s system 23% cpu 1:46.69 total gzip * 7.53s user 0.50s system 7% cpu 1:49.47 total gunzip * 1.25s user 0.41s system 1% cpu 1:55.85 total
Rerun:
bzip2 * 99.80s user 0.70s system 96% cpu 1:43.64 total
[edit] Consecutive forward diffs
[edit] Scripts used
[edit] get_revisions
#!/bin/bash
mkdir $2
mkdir $2/orig
SQL="mysql -B -D enwiki -e"
FLAT_OLDIDS=`$SQL"
select old_id from old where old_namespace=$1 and old_title='$2' order by old_timestamp
" | grep -v old_id`
OLDIDS=($FLAT_OLDIDS)
echo "$FLAT_OLDIDS" > $2/oldids
i=0
n=`echo $FLAT_OLDIDS | wc -w`
while [ $i -lt $n ]; do
echo $i
$SQL"
select old_text from old where old_id=${OLDIDS[$i]}
" | grep -v old_text | awk '{gsub(/\\n/,"\n");print}' > $2/orig/${OLDIDS[$i]}
if [ $i -gt 0 ]; then
diff $2/orig/${OLDIDS[$(($i-1))]} $2/orig/${OLDIDS[$i]} >> $2/diffs
fi
i=$(($i+1))
done
[edit] dodiffs
#!/bin/bash
rm -f $2/diffs
OLDIDS=(`cat $2/oldids`)
i=1
n=`echo ${OLDIDS[*]} | wc -w`
while [ $i -lt $n ]; do
diff -e $2/orig/${OLDIDS[$(($i-1))]} $2/orig/${OLDIDS[$i]} >> $2/diffs
i=$(($i+1))
done
[edit] Sizes
Note the different original sizes compared to the above.
$ du -S -h 3.1M ./Atheism/orig 376K ./Atheism 4.1M ./Physics/orig 268K ./Physics 14M ./Daniel_C._Boyer/orig 496K ./Daniel_C._Boyer 38M ./Cleanup/orig 424K ./Cleanup