Jump to content

User:Mdupont/WikiGrid

From Meta, a Wikimedia project coordination wiki

Here is my idea for the wikipedia grid.

Statement of problem

[edit]

I would like to process a section of the wikipedia with my netbook 8gb/500mb. The dump files are too huge to be processed efficiently. The bz2 files cannot be resumed to slice them into bits.

update

[edit]

Update : the bz2 can be sliced. using bzip2recover!

 curl -C1000000  http://download.wikimedia.org/enwiki/20090713/enwiki-20090713-pages-articles.xml.bz2 > test.bz2
 break after 1mb
 bzip2recover test.bz2
 bzcat rec00002test.bz2

it works!!!

possible solutions

[edit]
  1. Create an p2p network of the wikipedia dumps. Drawback : you cannot process the individual parts because bz2 cannot be resumed.
  2. Create an compressed p2p network that compresses the data on the fly. Surly bz2 could be made to resume.
  3. usage of a write once archive file system instead of bz2. http://rbytes.net/linux/cromfs-review/ basically we would mount the filesystem to read it.
  4. creation of multiple chunks/categories to split the data by country or by topic.

splitting

[edit]

onrecovered block looks like this : 265236 2009-07-23 08:34 rec00002test.bz2

when I do wget :

wget http://download.wikimedia.org/enwiki/20090713/enwiki-20090713-pages-articles.xml.bz2
--2009-07-23 09:05:01--  http://download.wikimedia.org/enwiki/20090713/enwiki-20090713-pages-articles.xml.bz2
Resolving download.wikimedia.org... 208.80.152.183, 2620:0:860:2:230:48ff:fe5a:eb1e
Connecting to download.wikimedia.org|208.80.152.183|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5340759527 (5.0G) [application/octet-stream]
Saving to: `enwiki-20090713-pages-articles.xml.bz2'

the size is displayed as: 5340759527

Getting the last part of the file :

 curl -C5340494291  http://downloadrg/enwiki/20090713/enwiki-20090713-pages-articles.xml.bz2 > test.bz2
 bzip2recover test.bz2
 bzcat  rec00001test.bz2


The input blocks are fixed size, but the output blocks are not. here is the last bit of the file, in block sizes:

  block 1 runs from 939312 to 2475024	 939312	        2475024	1535712
  block 2 runs from 2475073 to 4223261	 2475073	4223261	1748188
  block 3 runs from 4223310 to 5728877	 4223310	5728877	1505567
  block 4 runs from 5728926 to 7127137	 5728926	7127137	1398211
  block 5 runs from 7127186 to 8504183	 7127186	8504183	1376997
  block 6 runs from 8504232 to 9849227	 8504232	9849227	1344995
  block 7 runs from 9849276 to 11256328 9849276	11256328 1407052

The average size of the compressed block is : 1473817 =~ 1.5mb we can split the file into chunks for processing.


So here is a test : if the total size is 5340759527 That would give us around 3623.76 Blocks the middle would be around 2670379763.5

bzip2recover middle.bz2 bzcat rec00001middle.bz2 | grep -e\<page -e\<title

<page>
   <title>Microsoft money</title>
 <page>
   <title>Ms money</title>

Now to make a perl script to handle this.

Here is the first version : http://bazaar.launchpad.net/~jamesmikedupont/+junk/openstreetmap-wikipedia/revision/5

run it like this : perl GetPart.pl 1 [dump url] I added an overlap function to get a range around the target block. it is not really tested, can be buggy!

requirements

[edit]
  1. be able to download slices of the data and use it.
  2. if sharing the data in p2p we should be able to use the chunks we get to process them and not have to wait for the whole thing.


patches

[edit]

changed the overlap and blocksize. Now one full block overlaps on each section.


This

Check it like this : md5sum *_.bz2 | sort

here is an overlap

372fcdcd0eedcd16214eabed0630a102  rec00001wikipedia_dump_part_0301_.bz2
372fcdcd0eedcd16214eabed0630a102  rec00023wikipedia_dump_part_0300_.bz2
bef894978d27d09bc00f5912b6e3e72c  rec00001wikipedia_dump_part_0003_.bz2
bef894978d27d09bc00f5912b6e3e72c  rec00020wikipedia_dump_part_0002_.bz2

source

[edit]

This new version has range functionality. http://bazaar.launchpad.net/~jamesmikedupont/+junk/openstreetmap-wikipedia/revision/5


checker

[edit]

you will see that each chunk has one block in common with the predecessor and successor. this allows everyone to download one chunk and they will have at least the full article. you only use the first block for articles that span across into the second block.

md5sum *_.bz2 | sort | perl checksort.pl

637a7351f0b40003e3990b7a0f80d815 rec00001wikipedia_dump_part_0005_.bz2 637a7351f0b40003e3990b7a0f80d815 rec00020wikipedia_dump_part_0004_.bz2

7e4c51494320299723a5cfbae54743c9 rec00002wikipedia_dump_part_0004_.bz2 7e4c51494320299723a5cfbae54743c9 rec00021wikipedia_dump_part_0003_.bz2

8923c457ac312f07b8b7ddf3bd2ae73a rec00002wikipedia_dump_part_0005_.bz2 8923c457ac312f07b8b7ddf3bd2ae73a rec00021wikipedia_dump_part_0004_.bz2

fd2d8b84e2456d3ec0a8eab111a35e47 rec00001wikipedia_dump_part_0004_.bz2 fd2d8b84e2456d3ec0a8eab111a35e47 rec00020wikipedia_dump_part_0003_.bz2

You can see that block 20 of chunk 3 is the same as block 01 of chunk 4.