Data dumps/Import examples

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

Examples of the import process[edit]

Import into an empty wiki of el wiktionary on Linux with MySQL[edit]

MediaWiki version: 1.20

This wiki was chosen because it uses a non-latin1 character set, has a reasonable number of articles but isn't huge, and relies on only a small number of extensions.

I chose to import only the current pages, with User or Talk pages, because most folks who set up local mirrors want the article content and not the revision history or the discussion pages.

Before the import[edit]

  1. I downloaded the dumps for a given day. I got all the sql.gz files, the stub-articles.xml.gz file, and the pages-articles.xml.bz2 file from http://download.wikimedia.org/elwiktionary/ even though I knew there would be a few of those sql files I wouldn't need.
  2. I installed the prerequisites for MediaWiki, including MySQL, PHP 5, Apache, php-mysql, php-intl, ImageMagick and rsvg (see the manual).
  3. I downloaded MediaWiki 1.20 and unpacked it into /var/www/html/elwikt (your location may vary).
  4. I installed MediaWiki 1.20 on my laptop, with the following settings:
    • el for my language and the wiki language
    • MySQL database type
    • localhost for hostname (hey, it's a local install on my laptop :-P)
    • elwikt for database name
    • no database table prefix
    • root db username and password for the database username and password for install
    • a different user name and password for the database account for web access, with 'create if it does not exist' checked
    • InnoDB table format
    • Binary character set
    • Disable media uploads
    • use InstantCommons
  5. I selected the extensions I wanted installed via the installer, some of them not being necessary but I thought they would be useful to have if I did decide to locally edit:
    • ConfirmEdit
    • Gadgets
    • Nuke
    • ParserFunctions
    • RenameUser
    • Vector
    • WikiEditor
  6. I generated page, revision and text sql files from the stub and page content XML files, using mwxml2sql via the command mwxml2sql -s elwiktionary-blahblah-stub-articles.xml.gz -t elwiktionary-blahblah-pages-articles.xml.bz2 -f elwikt-pages-current-sql.gz -m 1.20
  7. I converted all the sql files to tab delimited files using sql2txt via the command zcat elwiktionary-blahdate-blahtable.sql.gz | sql2txt | gzip > elwiktionary-blahdate-blahtable.tabs.gz. Actually that's a lie, I wrote a tiny bash script to do them all for me. I skipped the following downloaded files:
    • site_stats - I didn't want or need these, the numbers would be wrong anyways
    • user_groups - Not needed for displaying page content
    • old_image and image - using InstantCommons
    • page - generated from XML files instead
  8. I converted the page, revision and text table files that were generated from the XML files, to tab delimited, using a command similar to the above step

The actual import[edit]

Note: maybe using charset 'binary' here would be better!

  1. I imported all of the above files into MySQL, doing the following:
    • mysql -u root -p
    • mysql>use elwikt
    • mysql>SET autocommit=0;
    • mysql>SET foreign_key_checks=0;
    • mysql>SET unique_checks=0;
    • mysql>SET character_set_client = utf8;
    • unpacked the tab delimited file
    • mysql>TRUNCATE TABLE tablenamehere;
    • mysql>LOAD DATA INFILE path-to-tab-delim-file-for-table-here FIELDS OPTIONALLY ENCLOSED BY '\'';
    • repeated this for all tab delim files
    • mysql>exit;

After the import[edit]

  1. Since this is a wiktionary, I updated the LocalSettings.php file so that page titles need not start with a capital letter, adding $wgCapitalLinks = false; to the file
  2. Since this wiki has extra namespaces beyond the standard ones defined by MediaWiki, I added those to LocalSettings.php. You can find such namespaces by looking at the first few lines of the stubs XML file. Lines added: $wgExtraNamespaces[100] = 'Παράρτημα'; and $wgExtraNamespaces[101] = 'Συζήτηση_παραρτήματος';.
  3. The namespace for the project and for project discussion are typically special localized names. I added those to LocalSettings.php, finding the names in the stub XML file at the beginning: $wgMetaNamespace = 'Βικιλεξικό'; and $wgMetaNamespaceTalk = 'Συζήτηση_βικιλεξικού';
  4. I installed tidy and added the following lines to LocalSettings.php to reflect that: $wgUseTidy = true; and $wgTidyBin = '/usr/bin/tidy';. No configuration file was necessary; one is provided as part of MediaWiki and used by default.
  5. I set up the interwiki cache cdb file, by using fixup-interwikis.py via the command python fixup-interwikis.py --localsettings /var/www/html/elwikt/LocalSettings.php --sitetype wiktionary and then added $wgInterwikiCache = "$IP/cache/interwiki.cdb" to the LocalSettings.php file. (See mw:Interwiki_cache/Setup_for_your_own_wiki for info.)

That was it. This was enough to let me view (most) pages without errors.

Caveats[edit]

I didn't deal with the math settings, I forgot to do anything about svg images, I didn't set up memcache, and I didn't deal with lucene search, all things that would be useful for a real mirror. I also skipped over the Cite extension, which would be a must for Wikipedia articles. But this was still enough to let me view (most) pages without errors.

There might be more things one could turn off in MySQL before starting to shovel data in.

I didn't try to fix up Special:Statistics. I would never run the updater script for this, as it would be too slow. This table should be populated by hand using the count of the titles in the main namespace, the number of pages in the main namespace with links (I think this can be gotten by munging the pagelinks table), etc.

Because this is not a huge wiki, I didn't need to break up the tab delimited files into chunks. The recommended procedure is to have a script that writes pieces of them to a fifo from which MySQL reads.

My mysql configuration settings were laughably on the low side:

  • key_buffer_size=256MB
  • max_allowed_packet=20M
  • innodb_buffer_pool_size=512M
  • innodb_additional_mem_pool_size=40M
  • innodb_log_file_size=128M
  • innodb_log_buffer_size=8M

For a large wiki these would not be at all viable.

Wikidata is going to change this procedure. I don't know what's needed for wikis that alredy have this enabled.

Flagged Revisions enabled on your wiki? You'll want those tables and you'll want to convert and import them. I'm sure there are some extension-specific configuration settings you'll need, and I'm equally sure I have no idea what those are.

Liquid Threads? No frickin' clue. Good luck on that one.

Import into an empty wiki of a subset of en wikipedia on Linux with MySQL[edit]

MediaWiki version: 1.21 (Branch REL1_21 from our git repository)

I chose the w:en:Wikipedia:WikiProject Cats content as the basis for this test.

I used a script designed to retrieve content specific to a given WikiProject, along with template, js and css pages. This was a proof of concept; it's possible that using Special:Export to retrieve the files and using importDump.php to import them would be faster. Anyone care to do some comparisons?

MediaWiki base installation[edit]

  1. I installed the prerequisites for MediaWiki, including MySQL, PHP 5, Apache, php-mysql, php-intl, ImageMagick and rsvg (see the manual).
  2. I git cloned MediaWiki and checked out the REL1_21 branch, then copied it into /var/www/html/enwiki (your location may vary).
  3. I configured the wiki on my laptop, with the following settings:
    • en for my language and the wiki language
    • MySQL database type
    • localhost for hostname (hey, it's a local install on my laptop :-P)
    • enwiki for database name
    • no database table prefix
    • root db username and password for the database username and password for install
    • a different user name and password for the database account for web access, with 'create if it does not exist' checked
    • InnoDB table format
    • Binary character set
    • Disable media uploads
    • use InstantCommons
  4. I had previously downloaded and copied the extensions I wanted into the extension subfolder of my MediaWiki installation. I selected the extensions I wanted installed via the installer, some of them not being necessary but I thought they would be useful to have if I did decide to locally edit:
    • Gadgets
    • ParserFunctions
    • Vector
    • WikiEditor

MediaWiki additional configuration[edit]

  1. I then poked around Special:Version on en.wikipedia.org and decided I should get and enable a few more extensions:
    • $wgUseAjax = true;
    • $wgCategoryTreeDynamicTag = true;
    • require_once( "$IP/extensions/CategoryTree/CategoryTree.php" );
    • require_once("$IP/extensions/Cite/Cite.php");
    • require_once("$IP/extensions/Cite/SpecialCite.php");
    • require_once("$IP/extensions/ImageMap/ImageMap.php");
    • require_once ( 'extensions/LabeledSectionTransclusion/lst.php' );
    • require_once("$IP/extensions/Poem/Poem.php");
  2. There were a few more things needed: Tidy, Scribunto, interwiki.cdb, wgContentHandlerUseDB:
    • $wgUseTidy = true;
    • $wgTidyBin = '/usr/bin/tidy';
    • require_once( "$IP/extensions/Scribunto/Scribunto.php" );
    • $wgScribuntoDefaultEngine = 'luasandbox';
    • $wgTemplateSandboxEditNamespaces[] = NS_MODULE;
  3. In order to finish the Scribunto setup I needed to get and install the sandbox code:
  4. I installed php-mbstring, needed for mb_check_encoding used by our wrappers for Lua
  5. For interwiki.cdb I downloaded this from [1] and copied it into /var/www/html/enwiki/cache/
  6. I set $wgContentHandlerUseDB = false; in LocalSettings.php as a workaround for the fact that the sql dumps were not going to have values for format and content model. These are fields specific to MW 1.21.

Getting the content for import[edit]

  1. I downloaded the dumps for a given day. I got all the sql.gz files from http://download.wikimedia.org/enwiki/latest even though I knew there would be a few of those sql files I wouldn't need. No content files were needed.
  2. I got the xmlfilesutils from my branch in git for the dumps: [2]
  3. I built the mwxml2sql utils, and put the python scripts, the utils and my downloaded dumps in some convenient locations.
  4. I ran the script to retrieve the content and filter the downloaded sql gz files accordingly:
    python ./wikicontent2sql.py --wikiproject 'Template:WikiProject Cats' --output catswiki --sqlfiles 'dump/enwiki-20130304-{t}.sql.gz' --wcr ./wikiretriever.py --mwxml2sql ./mwxml2sql-0.0.2/mwxml2sql --sqlfilter ./mwxml2sql-0.0.2/sqlfilter --verbose
  5. This spent a little time generating the list of titles for download (the WikiProject tempaltes are placed on talk pages rather than on the pages you want to import). It then downloaded the content; note that downloading all of the templates took a while, but the file of templates could be reused for other wikiproject imports if desired. It created page, table and revision sql files and filtered the other sql files in preparation for import.

Doing the import[edit]

I imported all of the sql files we needed, using a little bash script which could easily be generalized for other directories and other wiki dump files.

This was enough to let me view most pages without issues. In a few cases pages referred to media uploaded directly to en wp and not on commons, and those images did not render of course. Future work could include writing a little script to grab just those images.

Caveats[edit]

I skipped a lot of extensions that are enabled on en wp, guessing that my WikiProject wouldn't use them, such as the Math extension.

I ignored the Site statistics table and didn't try to update that manually.

I didn't bother with using LOAD DATA INFILE because this was a relatively small subset of content, even with all of the templates, which wouldn't take too long to import.

I ignored Flagged Revs completely, as well as WikiData.

See the caveats from the el wiktionary example for my mysql settings, which of course are intended for a local instance on a laptop and not for a real server taking a beating in production.


import enwiki[edit]

i have developed a script that generates bash commands with correct filenames output like: https://github.com/shimondoodkin/wikipedia-dump-import-script/blob/master/example-result.sh

after generation i manually delete parts of it like filenames in the for loop, tables, or lines . and execute only the required pard.


it helps to use pigz program for gzip and lbzip2 for bzip

apt-get install pigz lbzip2


patch the mwdumper code:

make sure you have java jdk(search google: install java ppa) apt-get install maven

git clone https://gerrit.wikimedia.org/r/mediawiki/tools/mwdumper patch the mwdumper code to do insert ignore into(faster) or replace into(prefered), search in mwdumper's code "INSERT" then run

compile it using: mvn compile package


apt-get install php5-cli

https://github.com/shimondoodkin/wikipedia-dump-import-script/blob/master/downloadlist.php php downloadlist.php > aaa bash aaa

use some commands generated by script mwdumper generates revisions to you dont need to import them with sql, this code generates a lot of commands you dont have to sue all of them, just the right ones.


Please add your stories below of bigger or hairier wikis and how you imported and configured sucessfully.