Data dumps/Import examples

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

Examples of the import process[edit]

Import into an empty wiki of el wikivoyage on Linux with MySQL[edit]

MediaWiki version: 1.29wmf19 Also, MariaDB version for local install: 10.1

This wiki was chosen because it's got a non-latin1 character set in a language I read, and it's small enough that I can import all revisions without it taking forever.

Before the import: MW downloads[edit]

  1. I downloaded the dumps for a given day. I actually downloaded everything even though I knew some files would not be needed.
  2. I made sure I had all prerequisites for MediaWiki installed: mysql, php5, apache, php-mysql, php-intl, ImgeMagick.
  3. I got an up to date clone of MediaWiki from the git repo, and checked out the 1.29wmf19 branch.
  4. I also got clones of the following extensions and checked out the same branch:
    1. ActiveAbstract -- needed to run dumps, which was one of the reasons for the install
    2. Babel -- used on some user pages
    3. CategoryTree -- enabled on the wiki and used
    4. Cite -- enabled and used on the wiki
    5. CiteThisPage -- enabled on the wiki, probably could have been skipped, ah well
    6. Disambiguator -- enabled and used on the wiki
    7. Gadgets -- enabled and used on the wiki
    8. ImageMap -- enabled and used on the wiki
    9. Interwiki -- enabled but do we need it? Not sure
    10. Kartographer -- enabled and used on the wiki
    11. LabeledSectionTransclusion -- enabled and used on the wiki
    12. MwEmbedSupport -- enabled and used on the wiki
    13. PagedTiffHandler -- enabled but it's unclear to me if it's really needed
    14. ParserFunctions -- a must have for all wikis
    15. PdfHandler -- do we need it? I erred on the side of caution
    16. Scribunto -- must have for modern wikis
    17. SiteMatrix -- I wrote the dang dumps api code that gets site-related info and I still don't know if we needed this extension. Better safe than sorry.
    18. TemplateData -- enabled and used on the wiki
    19. TimedMediaHandler -- enabled and used on the wiki, though I did not test it
    20. TocTree -- enabled and used on the wiki
    21. Wikidata -- enabled and used on the wiki, BUT SEE NOTES
    22. WikiEditor -- enabled and used on the wiki
    23. WikimediaMaintenance -- I wanted to have these scripts available, just in case
  5. I grabbed the repo for the default skin (Vector) and checked out the right branch for that as well
  6. I copied the MediaWiki repo into directory elwv under my html docroot, copied the various extension repos into the elwv/extensions directory, and began the install process.
  7. I copied the skin to elwv/skins/

Before the import: MW install[edit]

  1. I installed with settings:
    • el for my language and the wiki language
    • MySQL database type
    • localhost for hostname (hey, it's a local install on my laptop :-P)
    • elwikivoyage for database name
    • no database table prefix
    • root db username and password for the database username/password for install
    • InnoDB table format
    • Binary character set
    • Enable media uploads (no good reason, just forgot)
    • use InstantCommons
    • no email, no cache (can be fixed later), some cc license
    • I added a wgOverrideHostname just to be on the safe side (not a fqdn, since this is a laptop)
  2. I selected all the above extensions to be installed
  3. I added some settings to LocalSettings.php:
    • $wgContentHandlerUseDB = true;
    • $wgUseTidy = true;
    • $wgTidyBin = '/usr/bin/tidy';
    • $wgIncludeLegacyJavaScript = true; (needed for some stuff in MediaWiki:Common.js as imported)
    • $wgDefaultUserOptions['toc-floated'] = true; (needed for nicely styled toc on some pages to display correctly)
  4. For debugging purposes I also added the following settings:
    • $wgShowExceptionDetails = true;
    • $wgShowDebug = true;
    • $wgDebugLogFile = '/var/www/html/elwv/logs/debuglog.txt';

I will describe what had to be done with the WikiData extension below (BUT SEE BIG WARNINGS IN BOLDFACE).

The import[edit]

  1. I created a nice working directory elsewhere
  2. I put all the downloaded files into a subdirectory "imported"
  3. I wrote some import scripts and put them in the working directory. See operations/dumps/import-tools, you want extract_tablecreate.py and import_tables.sh from the xmlfileutils directory, and you must edit the latter for the wiki, dump date, database username and password, etc.
  4. I got a git clone of the operations/dumps/import-tools master branch (see link above for info), entered the xmlfileutils directory, and did 'make'.. well, after updating the code for MW 1.29 :-)
  5. I copied mwxml2sql, sql2txt and sqlfilter from there to my working directory in the first step
  6. I made sure wget is installed, just in case.
  7. I made sure mariadb was running (required for the MediaWiki install, in any case).
  8. I ran the import_tables.sh script.

Some notes about what this script does:

  • downloads wiki dump files for specific date and wiki name, if they are not already in the directory the script looks for, for files to import.
  • generates pages, revision, text sql files, plus sql for the creation of those tables, if it does not find generated files in its output directory already
  • converts the above to tab-delimited files
  • converts the downloaded sql files to tab-delimited files
  • grabs the CREATE TABLE statements from the downloaded sql files and puts those into files to be used for the import (see NOTE below)
  • drops tables for import except page/text/revision
  • truncates tables page/text/revision
  • recreates dropped tables with the above saved create statements
  • imports page/text/revision data to local mariadb instance using LOAD DATA INFILE

NOTE THAT we cannot rely on the tables as created by the MediaWiki install, because the order of fields may be different from one wiki to the next!

Example: elwikivoyage as dumped from db has:
CREATE TABLE `page_restrictions` (
  `pr_page` int(11) NOT NULL,
  `pr_type` varbinary(60) NOT NULL,
  `pr_level` varbinary(60) NOT NULL,
  `pr_cascade` tinyint(4) NOT NULL,
  `pr_user` int(11) DEFAULT NULL,
  `pr_expiry` varbinary(14) DEFAULT NULL,
  `pr_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`pr_id`),
  UNIQUE KEY `pr_pagetype` (`pr_page`,`pr_type`),
  KEY `pr_typelevel` (`pr_type`,`pr_level`),
  KEY `pr_level` (`pr_level`),
  KEY `pr_cascade` (`pr_cascade`)
) ENGINE=InnoDB AUTO_INCREMENT=11 DEFAULT CHARSET=binary;
The same table create statement in MediaWiki has:
CREATE TABLE /*_*/page_restrictions (
  pr_id int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
  pr_page int NOT NULL,
  pr_type varbinary(60) NOT NULL,
  pr_level varbinary(60) NOT NULL,
  pr_cascade tinyint NOT NULL,
  pr_user int NULL,
  pr_expiry varbinary(14) NULL
) /*$wgDBTableOptions*/;
CREATE UNIQUE INDEX /*i*/pr_pagetype ON /*_*/page_restrictions (pr_page,pr_type);
CREATE INDEX /*i*/pr_typelevel ON /*_*/page_restrictions (pr_type,pr_level);
CREATE INDEX /*i*/pr_level ON /*_*/page_restrictions (pr_level);
CREATE INDEX /*i*/pr_cascade ON /*_*/page_restrictions (pr_cascade);
Since we import things as tab-delimited for speed (using LOAD DATA INFILE), the order of the dumped fields must correspond to the order of the created table.

NOTE ALSO that we cannot rely on the create table statements in e.g. the page.sql file as downloaded, because those files may contain fields that the running version of MediaWiki does not have!

Example: https://phabricator.wikimedia.org/T86338 (page_counter still present)

After the import[edit]

  1. I checked that there are no extra namespaces I needed to add.
  2. TBH a few of the extensions I claimed to add in advance, I added afterwards when seeing that some pages didn't render, but I did a re-import by running the script again and all was well.
  3. The gorilla in the room: WikiData. Some pages used an image banner that relied on a WikiData property lookup. So I sighed and tried to configure all that crap. Here's the settings, stolen straight out of the WMF configs:
$wgEnableWikibaseRepo = false;
$wmgUseWikibaseRepo = false;
$wgEnableWikibaseClient = true;
$wmgUseWikibaseClient = true;

require_once( "$IP/extensions/Wikidata/Wikidata.php" );

$wgWBClientSettings['repoSiteName'] = 'wikibase-repo-name';
$baseNs = 120;
// Define the namespace indexes for repo (and client wikis also need to be aware of these,
// thus entityNamespaces need to be a shared setting).
//
// NOTE: do *not* define WB_NS_ITEM and WB_NS_ITEM_TALK when using a core namespace for items!
define( 'WB_NS_PROPERTY', $baseNs );
define( 'WB_NS_PROPERTY_TALK', $baseNs + 1 );
define( 'WB_NS_QUERY', $baseNs + 2 );
define( 'WB_NS_QUERY_TALK', $baseNs + 3 );


$wgWBSharedSettings['entityNamespaces'] = [
        'item' => NS_MAIN,
        'property' => WB_NS_PROPERTY
];

$wgWBSharedSettings['specialSiteLinkGroups'][] = 'wikidata';

$wgWBClientSettings = $wgWBSharedSettings + $wgWBClientSettings;

// to be safe, keeping this here although $wgDBname is default setting
$wgWBClientSettings['siteGlobalID'] = $wgDBname;

$wgWBClientSettings['changesDatabase'] = 'wikidatawiki';
$wgWBClientSettings['repoDatabase'] = 'wikidatawiki';

$wgWBClientSettings['repoNamespaces'] = [
        'item' => '',
        'property' => 'Property'
];

$wgWBClientSettings['languageLinkSiteGroup'] = 'wikivoyage';

$wgWBClientSettings['siteGroup'] = 'wikivoyage';
$wgWBClientSettings['otherProjectsLinksByDefault'] = true;

$wgWBClientSettings['excludeNamespaces'] = function() {
        // @fixme 102 is LiquidThread comments on wikinews and elsewhere?
        // but is the Extension: namespace on mediawiki.org, so we need
        // to allow wiki-specific settings here.
        return array_merge(
                MWNamespace::getTalkNamespaces(),
                // 90 => LiquidThread threads
                // 92 => LiquidThread summary
                // 118 => Draft
                // 1198 => NS_TRANSLATE
                // 2600 => Flow topic
                [ NS_USER, NS_FILE, NS_MEDIAWIKI, 90, 92, 118, 1198, 2600 ]
        );
};

$wmgWikibaseClientSettings = [];

foreach( $wmgWikibaseClientSettings as $setting => $value ) {
        $wgWBClientSettings[$setting] = $value;
}

$wgWBClientSettings['allowDataTransclusion'] = true;
$wgWBClientSettings['allowDataAccessInUserLanguage'] = false;
$wgWBClientSettings['entityAccessLimit'] = 400;

// fixme find out what the hell this is
$wgWBClientSettings['sharedCacheKeyPrefix'] = 'wadafuq';
$wgWBClientSettings['sharedCacheDuration'] = 60 * 60 * 24;

$wgWBClientSettings['repoUrl'] = 'https://www.wikidata.org';
$wgWBClientSettings['repoConceptBaseUri'] = 'http://www.wikidata.org/entity/';
$wgArticlePlaceholderImageProperty = 'P18';

$wgWBClientSettings['badgeClassNames'] = [
        'Q17437796' => 'badge-featuredarticle',
        'Q17437798' => 'badge-goodarticle',
        'Q17559452' => 'badge-recommendedarticle', // T72268
        'Q17506997' => 'badge-featuredlist', // T72332
        'Q17580674' => 'badge-featuredportal', // T75193
        'Q20748091' => 'badge-notproofread', // T97014 - Wikisource badges
        'Q20748094' => 'badge-problematic',
        'Q20748092' => 'badge-proofread',
        'Q20748093' => 'badge-validated',
        'Q28064618' => 'badge-digitaldocument', // T153186
];

// Overwrite or add commons links in the "other projects sidebar" with the "commons category" (P373), per T126960
$wgWikimediaBadgesCommonsCategoryProperty = $wgDBname === 'commonswiki' ? null : 'P373';

$wgArticlePlaceholderSearchEngineIndexed = false;
$wgWBClientSettings['propertyOrderUrl'] = 'https://www.wikidata.org/w/index.php?title=MediaWiki:Wikibase-SortedProperties&action=raw&sp_ver=1';

This allowed the wiki to at least parse property tags and come back with nothing, in templates that use them, and those templates have a fallback. And that allowed pages to render ok.

HOWEVER. A big huge fat HOWEVER. Wikidata DOES NOT have an InstantCommons-like mode where you can use the Wikidata repo of the WMF from your local wiki. See here: [1] (remind me to replace this link with a better one later). The open ticket for adding such functionality is inactive [2]. THIS MEANS that in order to use properties and entities, we have to figure out a way to import them into a local Wikibase repo, not trivial. Currently Wikidata has over 25 million items (entities, properties etc). Clearly we don't want to import all those. Compare to the English language Wikipedia; it has less than 6 million articles. I need to investigate all of this or see if others have done some work in this area.

After all of this most things looked like they rendered ok, but there were a couple tags I saw and didn't fix up. I was able to run dumps successfully on this install and that's what I really wanted it for.

Import into an empty wiki of el wiktionary on Linux with MySQL[edit]

MediaWiki version: 1.20

This wiki was chosen because it uses a non-latin1 character set, has a reasonable number of articles but isn't huge, and relies on only a small number of extensions.

I chose to import only the current pages, with User or Talk pages, because most folks who set up local mirrors want the article content and not the revision history or the discussion pages.

Before the import[edit]

  1. I downloaded the dumps for a given day. I got all the sql.gz files, the stub-articles.xml.gz file, and the pages-articles.xml.bz2 file from http://download.wikimedia.org/elwiktionary/ even though I knew there would be a few of those sql files I wouldn't need.
  2. I installed the prerequisites for MediaWiki, including MySQL, PHP 5, Apache, php-mysql, php-intl, ImageMagick and rsvg (see the manual).
  3. I downloaded MediaWiki 1.20 and unpacked it into /var/www/html/elwikt (your location may vary).
  4. I installed MediaWiki 1.20 on my laptop, with the following settings:
    • el for my language and the wiki language
    • MySQL database type
    • localhost for hostname (hey, it's a local install on my laptop :-P)
    • elwikt for database name
    • no database table prefix
    • root db username and password for the database username and password for install
    • a different user name and password for the database account for web access, with 'create if it does not exist' checked
    • InnoDB table format
    • Binary character set
    • Disable media uploads
    • use InstantCommons
  5. I selected the extensions I wanted installed via the installer, some of them not being necessary but I thought they would be useful to have if I did decide to locally edit:
    • ConfirmEdit
    • Gadgets
    • Nuke
    • ParserFunctions
    • RenameUser
    • Vector
    • WikiEditor
  6. I generated page, revision and text sql files from the stub and page content XML files, using mwxml2sql via the command mwxml2sql -s elwiktionary-blahblah-stub-articles.xml.gz -t elwiktionary-blahblah-pages-articles.xml.bz2 -f elwikt-pages-current-sql.gz -m 1.20
  7. I converted all the sql files to tab delimited files using sql2txt (same repo as previous step) via the command zcat elwiktionary-blahdate-blahtable.sql.gz | sql2txt | gzip > elwiktionary-blahdate-blahtable.tabs.gz. Actually that's a lie, I wrote a tiny bash script to do them all for me. I skipped the following downloaded files:
    • site_stats - I didn't want or need these, the numbers would be wrong anyways
    • user_groups - Not needed for displaying page content
    • old_image and image - using InstantCommons
    • page - generated from XML files instead
  8. I converted the page, revision and text table files that were generated from the XML files, to tab delimited, using a command similar to the above step

The actual import[edit]

Note: maybe using charset 'binary' here would be better!

  1. I imported all of the above files into MySQL, doing the following:
    • mysql -u root -p
    • mysql>use elwikt
    • mysql>SET autocommit=0;
    • mysql>SET foreign_key_checks=0;
    • mysql>SET unique_checks=0;
    • mysql>SET character_set_client = utf8;
    • unpacked the tab delimited file
    • mysql>TRUNCATE TABLE tablenamehere;
    • mysql>LOAD DATA INFILE path-to-tab-delim-file-for-table-here FIELDS OPTIONALLY ENCLOSED BY '\'';
    • repeated this for all tab delim files
    • mysql>exit;

After the import[edit]

  1. Since this is a wiktionary, I updated the LocalSettings.php file so that page titles need not start with a capital letter, adding $wgCapitalLinks = false; to the file
  2. Since this wiki has extra namespaces beyond the standard ones defined by MediaWiki, I added those to LocalSettings.php. You can find such namespaces by looking at the first few lines of the stubs XML file. Lines added: $wgExtraNamespaces[100] = 'Παράρτημα'; and $wgExtraNamespaces[101] = 'Συζήτηση_παραρτήματος';.
  3. The namespace for the project and for project discussion are typically special localized names. I added those to LocalSettings.php, finding the names in the stub XML file at the beginning: $wgMetaNamespace = 'Βικιλεξικό'; and $wgMetaNamespaceTalk = 'Συζήτηση_βικιλεξικού';
  4. I installed tidy and added the following lines to LocalSettings.php to reflect that: $wgUseTidy = true; and $wgTidyBin = '/usr/bin/tidy';. No configuration file was necessary; one is provided as part of MediaWiki and used by default.
  5. I set up the interwiki cache cdb file, by using fixup-interwikis.py via the command python fixup-interwikis.py --localsettings /var/www/html/elwikt/LocalSettings.php --sitetype wiktionary and then added $wgInterwikiCache = "$IP/cache/interwiki.cdb" to the LocalSettings.php file. (See mw:Interwiki_cache/Setup_for_your_own_wiki for info.)

That was it. This was enough to let me view (most) pages without errors.

Caveats[edit]

I didn't deal with the math settings, I forgot to do anything about svg images, I didn't set up memcache, and I didn't deal with lucene search, all things that would be useful for a real mirror. I also skipped over the Cite extension, which would be a must for Wikipedia articles. But this was still enough to let me view (most) pages without errors.

There might be more things one could turn off in MySQL before starting to shovel data in.

I didn't try to fix up Special:Statistics. I would never run the updater script for this, as it would be too slow. This table should be populated by hand using the count of the titles in the main namespace, the number of pages in the main namespace with links (I think this can be gotten by munging the pagelinks table), etc.

Because this is not a huge wiki, I didn't need to break up the tab delimited files into chunks. The recommended procedure is to have a script that writes pieces of them to a fifo from which MySQL reads.

My mysql configuration settings were laughably on the low side:

  • key_buffer_size=256MB
  • max_allowed_packet=20M
  • innodb_buffer_pool_size=512M
  • innodb_additional_mem_pool_size=40M
  • innodb_log_file_size=128M
  • innodb_log_buffer_size=8M

For a large wiki these would not be at all viable.

Wikidata is going to change this procedure. I don't know what's needed for wikis that alredy have this enabled.

Flagged Revisions enabled on your wiki? You'll want those tables and you'll want to convert and import them. I'm sure there are some extension-specific configuration settings you'll need, and I'm equally sure I have no idea what those are.

Liquid Threads? No frickin' clue. Good luck on that one.

Import into an empty wiki of a subset of en wikipedia on Linux with MySQL[edit]

MediaWiki version: 1.21 (Branch REL1_21 from our git repository)

I chose the w:en:Wikipedia:WikiProject Cats content as the basis for this test.

I used a script designed to retrieve content specific to a given WikiProject, along with template, js and css pages. This was a proof of concept; it's possible that using Special:Export to retrieve the files and using importDump.php to import them would be faster. Anyone care to do some comparisons?

MediaWiki base installation[edit]

  1. I installed the prerequisites for MediaWiki, including MySQL, PHP 5, Apache, php-mysql, php-intl, ImageMagick and rsvg (see the manual).
  2. I git cloned MediaWiki and checked out the REL1_21 branch, then copied it into /var/www/html/enwiki (your location may vary).
  3. I configured the wiki on my laptop, with the following settings:
    • en for my language and the wiki language
    • MySQL database type
    • localhost for hostname (hey, it's a local install on my laptop :-P)
    • enwiki for database name
    • no database table prefix
    • root db username and password for the database username and password for install
    • a different user name and password for the database account for web access, with 'create if it does not exist' checked
    • InnoDB table format
    • Binary character set
    • Disable media uploads
    • use InstantCommons
  4. I had previously downloaded and copied the extensions I wanted into the extension subfolder of my MediaWiki installation. I selected the extensions I wanted installed via the installer, some of them not being necessary but I thought they would be useful to have if I did decide to locally edit:
    • Gadgets
    • ParserFunctions
    • Vector
    • WikiEditor

MediaWiki additional configuration[edit]

  1. I then poked around Special:Version on en.wikipedia.org and decided I should get and enable a few more extensions:
    • $wgUseAjax = true;
    • $wgCategoryTreeDynamicTag = true;
    • require_once( "$IP/extensions/CategoryTree/CategoryTree.php" );
    • require_once("$IP/extensions/Cite/Cite.php");
    • require_once("$IP/extensions/Cite/SpecialCite.php");
    • require_once("$IP/extensions/ImageMap/ImageMap.php");
    • require_once ( 'extensions/LabeledSectionTransclusion/lst.php' );
    • require_once("$IP/extensions/Poem/Poem.php");
  2. There were a few more things needed: Tidy, Scribunto, interwiki.cdb, wgContentHandlerUseDB:
    • $wgUseTidy = true;
    • $wgTidyBin = '/usr/bin/tidy';
    • require_once( "$IP/extensions/Scribunto/Scribunto.php" );
    • $wgScribuntoDefaultEngine = 'luasandbox';
    • $wgTemplateSandboxEditNamespaces[] = NS_MODULE;
  3. In order to finish the Scribunto setup I needed to get and install the sandbox code:
  4. I installed php-mbstring, needed for mb_check_encoding used by our wrappers for Lua
  5. For interwiki.cdb I downloaded this from [3] and copied it into /var/www/html/enwiki/cache/
  6. I set $wgContentHandlerUseDB = false; in LocalSettings.php as a workaround for the fact that the sql dumps were not going to have values for format and content model. These are fields specific to MW 1.21.

Getting the content for import[edit]

  1. I downloaded the dumps for a given day. I got all the sql.gz files from http://download.wikimedia.org/enwiki/latest even though I knew there would be a few of those sql files I wouldn't need. No content files were needed.
  2. I got the xmlfilesutils from my branch in git for the dumps: [4]
  3. I built the mwxml2sql utils, and put the python scripts, the utils and my downloaded dumps in some convenient locations.
  4. I ran the script to retrieve the content and filter the downloaded sql gz files accordingly:
    python ./wikicontent2sql.py --wikiproject 'Template:WikiProject Cats' --output catswiki --sqlfiles 'dump/enwiki-20130304-{t}.sql.gz' --wcr ./wikiretriever.py --mwxml2sql ./mwxml2sql-0.0.2/mwxml2sql --sqlfilter ./mwxml2sql-0.0.2/sqlfilter --verbose
  5. This spent a little time generating the list of titles for download (the WikiProject tempaltes are placed on talk pages rather than on the pages you want to import). It then downloaded the content; note that downloading all of the templates took a while, but the file of templates could be reused for other wikiproject imports if desired. It created page, table and revision sql files and filtered the other sql files in preparation for import.

Doing the import[edit]

I imported all of the sql files we needed, using a little bash script which could easily be generalized for other directories and other wiki dump files.

This was enough to let me view most pages without issues. In a few cases pages referred to media uploaded directly to en wp and not on commons, and those images did not render of course. Future work could include writing a little script to grab just those images.

Caveats[edit]

I skipped a lot of extensions that are enabled on en wp, guessing that my WikiProject wouldn't use them, such as the Math extension.

I ignored the Site statistics table and didn't try to update that manually.

I didn't bother with using LOAD DATA INFILE because this was a relatively small subset of content, even with all of the templates, which wouldn't take too long to import.

I ignored Flagged Revs completely, as well as WikiData.

See the caveats from the el wiktionary example for my mysql settings, which of course are intended for a local instance on a laptop and not for a real server taking a beating in production.

Import of enwiki[edit]

I have developed a script that generates bash commands with correct filenames output like: https://github.com/shimondoodkin/wikipedia-dump-import-script/blob/master/example-result.sh

After generation i manually delete parts of it like filenames in the for loop, tables, or lines . and execute only the required part.

It helps to use pigz program for gzip and lbzip2 for bzip, so you can install these:

  • apt-get install pigz lbzip2

Patch the mwdumper code:

  1. make sure you have java jdk(search google: install java ppa)
  2. apt-get install maven
  3. git clone https://gerrit.wikimedia.org/r/mediawiki/tools/mwdumper
  4. patch the mwdumper code to do insert ignore into(faster) or replace into(prefered), search in mwdumper's code "INSERT" then run
  5. ompile it using: mvn compile package

apt-get install php5-cli

Grab the download script generator from here: https://github.com/shimondoodkin/wikipedia-dump-import-script/blob/master/downloadlist.php and run

  1. php downloadlist.php > aaa
  2. bash aaa

Use some (not all) of the commands generated by the script, have a look at it to see what you need.

mwdumper generates revisions, you don't need to import them with sql.

This code generates a lot of commands you don't have to use all of them, just the right ones.


Please add your stories below of bigger or hairier wikis and how you imported and configured sucessfully.