User:Mdupont/WikiGrid
Here is my idea for the wikipedia grid.
Statement of problem
[edit]I would like to process a section of the wikipedia with my netbook 8gb/500mb.
The dump files are too huge to be processed efficiently.
The bz2 files cannot be resumed to slice them into bits.
update
[edit]Update : the bz2 can be sliced. using bzip2recover!
curl -C1000000 http://download.wikimedia.org/enwiki/20090713/enwiki-20090713-pages-articles.xml.bz2 > test.bz2 break after 1mb bzip2recover test.bz2 bzcat rec00002test.bz2
it works!!!
possible solutions
[edit]- Create an p2p network of the wikipedia dumps. Drawback : you cannot process the individual parts because bz2 cannot be resumed.
- Create an compressed p2p network that compresses the data on the fly. Surly bz2 could be made to resume.
- usage of a write once archive file system instead of bz2. http://rbytes.net/linux/cromfs-review/ basically we would mount the filesystem to read it.
- creation of multiple chunks/categories to split the data by country or by topic.
splitting
[edit]onrecovered block looks like this : 265236 2009-07-23 08:34 rec00002test.bz2
when I do wget :
wget http://download.wikimedia.org/enwiki/20090713/enwiki-20090713-pages-articles.xml.bz2 --2009-07-23 09:05:01-- http://download.wikimedia.org/enwiki/20090713/enwiki-20090713-pages-articles.xml.bz2 Resolving download.wikimedia.org... 208.80.152.183, 2620:0:860:2:230:48ff:fe5a:eb1e Connecting to download.wikimedia.org|208.80.152.183|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 5340759527 (5.0G) [application/octet-stream] Saving to: `enwiki-20090713-pages-articles.xml.bz2'
the size is displayed as: 5340759527
Getting the last part of the file :
curl -C5340494291 http://downloadrg/enwiki/20090713/enwiki-20090713-pages-articles.xml.bz2 > test.bz2 bzip2recover test.bz2 bzcat rec00001test.bz2
The input blocks are fixed size, but the output blocks are not.
here is the last bit of the file, in block sizes:
block 1 runs from 939312 to 2475024 939312 2475024 1535712 block 2 runs from 2475073 to 4223261 2475073 4223261 1748188 block 3 runs from 4223310 to 5728877 4223310 5728877 1505567 block 4 runs from 5728926 to 7127137 5728926 7127137 1398211 block 5 runs from 7127186 to 8504183 7127186 8504183 1376997 block 6 runs from 8504232 to 9849227 8504232 9849227 1344995 block 7 runs from 9849276 to 11256328 9849276 11256328 1407052
The average size of the compressed block is : 1473817 =~ 1.5mb we can split the file into chunks for processing.
So here is a test :
if the total size is 5340759527
That would give us around 3623.76 Blocks
the middle would be around 2670379763.5
bzip2recover middle.bz2 bzcat rec00001middle.bz2 | grep -e\<page -e\<title
<page> <title>Microsoft money</title> <page> <title>Ms money</title>
Now to make a perl script to handle this.
Here is the first version : http://bazaar.launchpad.net/~jamesmikedupont/+junk/openstreetmap-wikipedia/revision/5
run it like this : perl GetPart.pl 1 [dump url] I added an overlap function to get a range around the target block. it is not really tested, can be buggy!
requirements
[edit]- be able to download slices of the data and use it.
- if sharing the data in p2p we should be able to use the chunks we get to process them and not have to wait for the whole thing.
patches
[edit]changed the overlap and blocksize. Now one full block overlaps on each section.
This
Check it like this : md5sum *_.bz2 | sort
here is an overlap
372fcdcd0eedcd16214eabed0630a102 rec00001wikipedia_dump_part_0301_.bz2 372fcdcd0eedcd16214eabed0630a102 rec00023wikipedia_dump_part_0300_.bz2
bef894978d27d09bc00f5912b6e3e72c rec00001wikipedia_dump_part_0003_.bz2 bef894978d27d09bc00f5912b6e3e72c rec00020wikipedia_dump_part_0002_.bz2
source
[edit]This new version has range functionality. http://bazaar.launchpad.net/~jamesmikedupont/+junk/openstreetmap-wikipedia/revision/5
checker
[edit]you will see that each chunk has one block in common with the predecessor and successor. this allows everyone to download one chunk and they will have at least the full article. you only use the first block for articles that span across into the second block.
md5sum *_.bz2 | sort | perl checksort.pl
637a7351f0b40003e3990b7a0f80d815 rec00001wikipedia_dump_part_0005_.bz2 637a7351f0b40003e3990b7a0f80d815 rec00020wikipedia_dump_part_0004_.bz2
7e4c51494320299723a5cfbae54743c9 rec00002wikipedia_dump_part_0004_.bz2 7e4c51494320299723a5cfbae54743c9 rec00021wikipedia_dump_part_0003_.bz2
8923c457ac312f07b8b7ddf3bd2ae73a rec00002wikipedia_dump_part_0005_.bz2 8923c457ac312f07b8b7ddf3bd2ae73a rec00021wikipedia_dump_part_0004_.bz2
fd2d8b84e2456d3ec0a8eab111a35e47 rec00001wikipedia_dump_part_0004_.bz2 fd2d8b84e2456d3ec0a8eab111a35e47 rec00020wikipedia_dump_part_0003_.bz2
You can see that block 20 of chunk 3 is the same as block 01 of chunk 4.