User:Brion VIBBER/Dump build split

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

Current plan: split to four threads; one for enwiki, one for the next few largest wikis, one for a few dozen more medium-to-large, and a fourth for everything else.

This will allow spreading out the timing more, make better utilization of database servers, etc.

Currently going to run:

  • thread 1 (enwiki) on srv31
  • thread 2 (large) on benet
  • thread 3 (medium) on srv31
  • thread 4 (small) on benet

ZOMG[edit]

Revision counts for top 50 wikis (April 2006).png

Handy splitter tool[edit]

Attempts to break up the database list into similar-sized chunks. Not totally succesful. ;)

<?php

$total = 0;
$counts = array();
$threads = 4;
$fudge = 1.0;

foreach( file("dbsizes.csv") as $line ) {
        list( $revs, $db ) = explode( "\t", trim( $line ) );
        if( $db == "Database" ) continue;

        //echo "$db: $revs\n";
        $counts[] = array( "db" => $db, "revs" => intval( $revs ) );
        $total += intval( $revs );
}

$perthread = intval( $total / $threads );

echo "Total: $total\n";
echo "Desired threads: $threads\n";
echo "Ideal count per thread: $perthread\n";

$assignments = array();
$dbindex = 0;
for( $i = 0; $i < $threads; $i++ ) {
        $assignments[$i] = array();
        $dbcount = 0;
        $revcount = 0;
        
        while( $revcount < $perthread * $fudge && $dbindex < count( $counts ) ) {
                $revcount += $counts[$dbindex]["revs"];
                $assignments[$i][] = $counts[$dbindex];
                $dbindex++;
                $dbcount++;
        }
        
        echo "Thread $i: $dbcount databases, $revcount revisions\n";
}

foreach( $assignments as $i => $dbs ) {
        echo "\n# Thread $i\n";
        usort( $dbs, 'sortDatabases' );
        foreach( $dbs as $item ) {
                echo $item["db"] . "\n";
        }
}

function sortDatabases( $a, $b ) {
        return strcmp( $a["db"], $b["db"] );
}

?>

Suggested splits from the tool[edit]

Total: 113956291
Desired threads: 4
Ideal count per thread: 28489072
Thread 0: 1 databases, 48078833 revisions
Thread 1: 4 databases, 29382691 revisions
Thread 2: 40 databases, 28577478 revisions
Thread 3: 635 databases, 7917289 revisions

Thread 0[edit]

  • enwiki

Thread 1[edit]

  • dewiki
  • frwiki
  • nlwiki
  • plwiki

Thread 2[edit]

  • arwiki
  • bgwiki
  • bgwiktionary
  • cawiki
  • commonswiki
  • cswiki
  • dawiki
  • dewiktionary
  • enwikibooks
  • enwikinews
  • enwikiquote
  • enwiktionary
  • eowiki
  • eswiki
  • etwiki
  • fiwiki
  • frwiktionary
  • hewiki
  • hrwiki
  • huwiki
  • idwiki
  • iowiktionary
  • itwiki
  • ltwiki
  • metawiki
  • nowiki
  • plwiktionary
  • ptwiki
  • rowiki
  • ruwiki
  • sep11wiki
  • skwiki
  • slwiki
  • sourceswiki
  • srwiki
  • svwiki
  • trwiki
  • ukwiki
  • viwiki
  • zhwiki

Thread 3[edit]

  • everything else!