History compression

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

Various methods for history compression

Concatenation and compression[edit]

Apparently this has been implemented (see http://www.ccc.de/congress/2004/fahrplan/event/63). Details are probably available somewhere.

./orig:
total 93M
-rw-rw-r--    1 tstarlin wikidev      8.4k Jan 22 06:37 120.txt
-rw-rw-r--    1 tstarlin wikidev      1.7M Jan 22 06:37 atheism.txt
-rw-rw-r--    1 tstarlin wikidev       74M Jan 22 06:37 cleanup.txt
-rw-rw-r--    1 tstarlin wikidev      5.1k Jan 22 06:37 Ohio_Valley_Conference.txt
-rw-rw-r--    1 tstarlin wikidev      3.3M Jan 22 06:37 physics.txt
-rw-rw-r--    1 tstarlin wikidev       13M Jan 22 06:37 Talk:Daniel_C._Boyer.txt

./gzip9:
total 15M
-rw-rw-r--    1 tstarlin wikidev       737 Jan 22 06:28 120.txt.gz
-rw-rw-r--    1 tstarlin wikidev       27k Jan 22 06:07 atheism.txt.gz
-rw-rw-r--    1 tstarlin wikidev       11M Jan 22 06:18 cleanup.txt.gz
-rw-rw-r--    1 tstarlin wikidev       802 Jan 22 06:26 Ohio_Valley_Conference.txt.gz
-rw-rw-r--    1 tstarlin wikidev       50k Jan 22 06:23 physics.txt.gz
-rw-rw-r--    1 tstarlin wikidev      2.5M Jan 22 06:32 Talk:Daniel_C._Boyer.txt.gz

./bzip2:
total 2.8M
-rw-rw-r--    1 tstarlin wikidev       984 Jan 22 06:37 120.txt.bz2
-rw-rw-r--    1 tstarlin wikidev       31k Jan 22 06:37 atheism.txt.bz2
-rw-rw-r--    1 tstarlin wikidev      2.2M Jan 22 06:37 cleanup.txt.bz2
-rw-rw-r--    1 tstarlin wikidev       984 Jan 22 06:37 Ohio_Valley_Conference.txt.bz2
-rw-rw-r--    1 tstarlin wikidev       64k Jan 22 06:37 physics.txt.bz2
-rw-rw-r--    1 tstarlin wikidev      480k Jan 22 06:37 Talk:Daniel_C._Boyer.txt.bz2

Timing on pliny[edit]

bunzip2

real    0m34.049s
user    0m23.610s
sys     0m1.650s

bzip2

real    1m40.663s
user    1m25.720s
sys     0m1.790s

Timing on unloaded 2.4 GHz Pentium 4[edit]

bzip2 *  98.70s user 0.97s system 57% cpu 2:52.51 total
bunzip2 *  23.28s user 1.48s system 23% cpu 1:46.69 total
gzip *  7.53s user 0.50s system 7% cpu 1:49.47 total 
gunzip *  1.25s user 0.41s system 1% cpu 1:55.85 total

Rerun:

bzip2 *  99.80s user 0.70s system 96% cpu 1:43.64 total

Consecutive forward diffs[edit]

Scripts used[edit]

get_revisions[edit]

#!/bin/bash
mkdir $2
mkdir $2/orig

SQL="mysql -B -D enwiki -e"
FLAT_OLDIDS=`$SQL"
  select old_id from old where old_namespace=$1 and old_title='$2' order by old_timestamp
" | grep -v old_id`
OLDIDS=($FLAT_OLDIDS)

echo "$FLAT_OLDIDS" > $2/oldids
i=0
n=`echo $FLAT_OLDIDS | wc -w`
while [ $i -lt $n ]; do
        echo $i

        $SQL"
          select old_text from old where old_id=${OLDIDS[$i]}
        " | grep -v old_text | awk '{gsub(/\\n/,"\n");print}' > $2/orig/${OLDIDS[$i]}
        if [ $i -gt 0 ]; then
                diff $2/orig/${OLDIDS[$(($i-1))]} $2/orig/${OLDIDS[$i]} >> $2/diffs
        fi
        i=$(($i+1))
done

dodiffs[edit]

#!/bin/bash
rm -f $2/diffs

OLDIDS=(`cat $2/oldids`)

i=1
n=`echo ${OLDIDS[*]} | wc -w`
while [ $i -lt $n ]; do
        diff -e $2/orig/${OLDIDS[$(($i-1))]} $2/orig/${OLDIDS[$i]} >> $2/diffs
        i=$(($i+1))
done

Sizes[edit]

Note the different original sizes compared to the above.

$ du -S -h
3.1M    ./Atheism/orig
376K    ./Atheism
4.1M    ./Physics/orig
268K    ./Physics
14M     ./Daniel_C._Boyer/orig
496K    ./Daniel_C._Boyer
38M     ./Cleanup/orig
424K    ./Cleanup