Data dumps/Import issues

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

Issues encountered after import[edit]

After import, the articles are messed up; I see randm HTML tags, or templates do not display properly. What can I do?[edit]

Install and configure tidy.

The Wikimedia Foundations websites use the tidy program which corrects improperly formatted HTML files produced by the MediaWiki software. Many of these problems are not necessarily the fault of the MediaWiki software package or its supporting logic. They are caused by users of Wikipedia inputting poorly designed templates and other errors in the content itself. Since it is almost impossible to correct these data elements in the XML dumps, a program which corrects the HTML output rendered by MediaWiki solves a lot of these problems.

Make certain you have tidy installed on your system, then add the following commands to the LocalSettings.php file for your MediaWiki installation:

$wgUseTidy = true;
$wgTidyBin = '/usr/bin/tidy';

From MediaWiki version 1.10 and on, a working configuration for tidy is provided in the includes/ directory of MediaWiki and the default path uses this file. See mw:Manual:$wgTidyConf for more on this.

For more information on using tidy, see:

My images are not rendering properly or do not show up at all in the article. How do I fix it?[edit]

This assumes you are not using InstantCommons or a Foreign File Repo (requesting media from Commons or another external web site on the fly).

First off, you must run rebuildImages.php if you have externally copied images into the /images directory under your MediaWiki installation. This is required in order for MediaWiki to place database entries for the images into the master database and resync them with articles. There are two methods of invoking this script:

Rebuild the entire image table for MediaWiki:

php maintenance/rebuildImages.php >& images.log &

Rebuild the missing images for the Image Table for MediaWiki:

php maintenance/rebuildImages.php --missing >& images.log &

You should run the program in both modes after copying images into the /images directory to sync up the database.

Please note that if you have set $wgMimeDetectorCommand = "file -bi"; in your LocalSettings.php file, when this command encounters corrupted images it will report "very short file" errors and cause rebuildImages.php to halt execution. Please check your images.log file if you encounter this error and delete the file in question; then restart the rebuildImage.php script. The file -bi command does not have a silent mode and MediaWiki will halt on errors where it cannot determine the MIME type for a particular image file.

SVG rendering does not work with all of the SVG files referenced by the Wikipedia Dumps in MediaWiki 1.9.3 on several Linux distributions at present with rsvg. There is no known fix at present for SVG based images for these problems. SVG+XML image files appear to render correctly in most cases.

How do I get the system statistics to show up properly after I import the XML files?[edit]

Run the initStats.php maintenance script after your import completes. Warning: this may take a very long time to complete. This will clear out the stats and update them with correct information:

php maintenance/initStats.php --update

There are several options to initStats.php depending on which MediaWiki version you are currently running. You can open the file initStats.php with a text editor and go to the bottom of the file for usage information.

My wiki takes a long time to display articles. What can I do to get better performance?[edit]

  • Install eaccelerator or another php compiler to compile the php scripts real time.
  • Set up memcache.
  • Enable the MediaWiki caching to cache generated HTML files on disk so clients accessing your site will not always execute the php MediaWiki code to create what would essentially be static articles.

MediaWiki will use a local directory on your server and create a caching system very similiar to that provided by the squid web proxy cache by caching the articles after they have been rendered into HTML and allow clients to read the HTML files directly rather than always invoking php code to recreate the article from wikitext (which is what the articles are written in). To enable MediaWiki caching of HTML, you need to create a directory and set permissions of the directory to 777 (rwxrwxrwx). You then point MediaWiki to this directory and enable the caching. You may also want to delay SQL updates of articles during editing and updates from remote users to increase system performance. A sample entry in the LocalSettings.php file which enables this capability would be:

$wgUseFileCache = true;
$wgFileCacheDirectory = "$IP/cache";   <- point to directory to cache
$wgAntiLockFlags = ALF_NO_LINK_LOCK | ALF_NO_BLOCK_LOCK;  <- delay writes to improve parallelism

The local caching capability of MediaWiki will not verify how much disk space you have installed. Make certain you have a lot of disk space if you enable this. Rendered HTML files for several million English articles can be very large and consume a large amount of storage.

I get these strange math tags all over the articles with text formatting strings where mathematical formulas should be. What have I done wrong?[edit]

Install texvc and set up directories for it to use.

MediaWiki uses the texvc program to render math expressions by dynamically creating png images for these scripts. Make certain you have downloaded the texvc packages or downloaded O-Caml to rebuild this program. The program is located in the /math directory off your main MediaWiki root directory. Also, make certain you have created a /tmp directory off the root and set the permissions to 777 (rwxrwxrwx) to allow the texvc program workspace to render the images from articles.

I get these strange tags like expr, ref, and cite when my articles are displayed and they are filled with unsightly looking text. How do I fix this?[edit]

Install and configure ParserFunctions, Cite, and ImageMap MediaWiki extensions.

You need to include a minimal set of parser extensions to parse citation and reference tags. Most of the templates in Wikipedia are a lot like actual code, and process instructions. Two of the parsers that are essential to render 99% of the Wikipedia articles are the Cite.php and the ParserFunctions.php extensions. Wikipedia heavily relies on these extensions. They can be downloaded from Meta or the MediaWiki site. You should place them into your extensions directory and add the following lines to the end of your MediaWiki LocalSettings.php file:

require_once( "$IP/extensions/ParserFunctions/ParserFunctions.php" );
require_once( "$IP/extensions/Cite/Cite.php" );
require_once( "$IP/extensions/ImageMap/ImageMap.php" );

There are at present two different ImageMap extensions, once for MediaWiki 1.5 and another for MediaWiki 1.9. This example refers to the MediaWiki 1.9 version, which is required with recent Wikipedia dumps. If you include the ImageMap.php extension, you may be required to enable the php-dom module. On Federa Core 5 and Fedora Core 6, use the YUM utility to update your system:

yum install php-dom <enter>

You may be required to upgrade both your mysql database and PHP modules to 5.1.6 in order to properly support php-dom and the ImageMap.php extension. If you have been using eaccelerator or another PHP compiler or accelerator, you will also need to update this module as well. On FC5, the command to update is:

yum install php-eaccelerator <enter>

I have imported the XML dump and enabled all of the extensions, but I keep getting red link tags to links to other language versions and they don't show up correctly in the left sidebar. How do I fix this?[edit]

For MediaWiki earlier than 1.19, make sure you have imported the interwiki.sql file from the dump, and for MediaWiki 1.19 or later set up the interwiki.cdb file.

For MediaWiki verions before 1.19, you have to update the mysql database with the interwiki.sql file from the project dump. You must have already installed MediaWiki successfully OR have created the database for the wiki manually in order to apply the sql script. Example command:

zcat path-to-file/wikiname-date-interwiki.sql.gz | mysql -u username-here -p 

Note that if you installed MediaWiki using a database table prefix for your tables, you'll need to edit this file to update the table name accordingly. The file is small enough that you could do it with a plain text editor, but if you prefer you can always use sed to update the name. For example if your database table prefix were 'mw_', you could give the command

zcat path-to-file/wikiname-date-interwiki.sql.gz | \
sed -re 's/^(DROP TABLE IF EXISTS|INSERT INTO|CREATE TABLE|\/\*!40000 ALTER TABLE) `interwiki`/\1 `mw_interwiki`/g;' | \
mysql -u username-here -p

For versions of MediaWiki from 1.19 on, you need instead to set up the interwiki.cdb file. See information on the interwiki cache for that.

I imported the dumps from the English Wikipedia into my Spanish Language MediaWiki installation so I can use them to translate into Spanish, but now Recent Changes and other navigation links point to non-existent pages. What did I do wrong?[edit]

By default, most of the enwiki dumps from the English Wikipedia (this same statement also applies to just about any language specific XML dump provided by the foundation) contain MediaWiki namespace entries which define such things as the navigation toolbar, site messages, and other MediaWiki specific configuration settings. These entries are included in the XML dumps by language to allow someone to import the XML dump for a target language and in essence totally replicate the Wikipedia site provided the XML dump is imported properly to a target MediaWiki installation.

Many of the MediaWiki namespace entries are very specific to the precise language used on a particular Wikipedia project, and can cause problems if the XML dump is imported into a MediaWiki installation configured to a target language different from the source language from which the XML dump originated.

It is possible to import XML dumps compiled from one particular language into a MediaWiki site configured for an entirely different language, however, it is recommended you use mwdumper to convert the XML file into another target XML dump and strip out the MediaWiki specific namespace entries from the target XML file. This can be accomplished by setting mwdumper to use the following syntax to remove the MediaWiki namespace records from the target file. In the example below, enwiki-<date>.xml is the source XML file, and enwiki-no_mediawiki.xml is the output XML dump with the MediaWiki namespace entries removed:

 java -jar mwdumper.jar --output=file:enwiki-no_mediawiki.xml --format=xml --filter=namespace:\!NS_MEDIAWIKI \
      enwiki-<date>.xml

If you use this approach you may wish to grab the MediaWiki namespace entries that end in .js or .css and import them separately, as they may contain javascript and/or stylesheet entries that the wikitext on the imported wiki relies on.