Jump to content

Data dumps/ImportDump.php

From Meta, a Wikimedia project coordination wiki

importDump.php

[edit]

MediaWiki 1.5 and above includes a command-line script 'importDump.php' which can be used to import an XML page dump into the database. This requires first configuring and installing MediaWiki. It's also very slow.

As an example invocation, when you have an XML file called temp.xml

 php maintenance/importDump.php < maintenance/temp.xml

As an example invocation, when you have an XML file called temp.xml with an output log and in the background:

 php maintenance/importDump.php < maintenance/temp.xml >& progress.log &

On Unix based systems, you can display the log entries real time to the console screen by using the tail command (You may kill the tail process and reinvoke the command later to periodically check your status without disturbing your import):

 tail -f progress.log

Common problems and solutions

[edit]

I get NULL title errors and importDump.php crashes with an exception. What can I do to get it working?

One solution to get around these problems is to insert a "+" character into $wgLegalTitleChars (this value does not exist by default prior to 1.8.0, see mw:Manual:$wgLegalTitleChars) in the LocalSettings.php file located in the root MediaWiki directory to deal with NULL titles in articles and articles which require URL expansion:

$wgLegalTitleChars = " %!\"$&'()*,\\-.\\/0-9:;=?@A-Z\\\\^_`a-z~\\x80-\\xFF+"; <-- add the "+' character like this

Note: the above is identical with the current version (checked: 1.16 alpha), so no change is necessary in newer versions.

Also, if you are seeing problems with importDump.php it is helpful to get the proper output from MediaWiki about where the error is occurring. Add this command to LocalSettings.php and restart importDump.php after the error to get more helpful error information for reporting or analyzing the problem.

$wgShowExceptionDetails = true;

importDump.php is very slow and will render images and TeX settings while the articles are importing. On large dumps, such as the English Wikipedia (enwiki), it can take over a week to update an active Wiki. As of MediaWiki 1.9.3 it still has severe problems with published dumps provided by the Wikimedia Foundation. It is advised to write your own utilities to strip out articles with NULL titles from the dumps in those cases where importDump.php does not work with LEX or a "C" based program.

I am still getting NULL title crashes with importDump.php after adding the '+' character. Are there any other solutions to this problem?

The articles contained in the XML dumps and some of the templates used to create the Wikipedia: namespace entries are particularly troublesome in some of the dumps. Many titles reference deleted articles from the Wikipedia: namespace which are the residuals of spamming and other attacks to deface Wikipedia. The PHP based XML parser libraries on most Linux systems have built in checks to reject garbage text strings with inconsistent XML tags. The best solution is to enabled debugging and attempt to determine which specific article is causing problems. Enable the debug log and run import Dump again. It should display the last article successfully imported in the debug.log file.

 $wgDebugLogFile = "debug.log";

Fixing null article titles and gfdl site compliance

[edit]

OK great. I tracked down the article causing the NULL title crashes in importDump.php, now what do I do to get the XML dump file corrected?

You can use the gfdl-wikititle utility to strip out or correct damaged titles in Wikipedia XML dumps. This program will also allow you to insert links into your dump which point back into the English Wikipedia or to whatever URL you wish to link each article to. It is required by the GFDL License that you attribute authorship to articles contained in the dumps. The preferred method recommended by the Wikimedia Foundation and Wikipedia Community is to post the text of the GFDL license on your MediaWiki site then link each article back to the parent article on Wikipedia. This is most easily accomplished by inserting an interwiki language link into each article. gfdl-wikititle inserts interwiki links into the XML dump and also can strip out and fix bad titles which cause the importDump.php program to crash with NULL title errors. Please refer to the meta page which describes gfdl-wikititle.

The source code for this program for Windows and Linux is posted to meta, and can be downloaded and modified if you need to enhance or alter its output of the XML dumps.

I am running importDump.php and it's really slow and I am seeing a large number of what appear to be error messages being output in the logs like "TeX" was running and "radicaleye.com dvips". What can I do?

Some tips in increasing the performance of importDump.php when attempting to import enwiki dumps are:

  • set $wgUseTex = false; in LocalSettings.php before running importDump.php. This will delay Tex formatting until someone actually tries to read an article from the web. Be certain to set this variable back to "true" after running importDump.php to re-enable TeX formatting for formulas and other content.
  • do not run maintenance/rebuildImages.php until after the dump has imported on a new MediaWiki installation if you have downloaded the image files. importDump.php will attempt to render the images if they have templates that insert thumbnail images into articles with ImageMagic or convert and this will slow down importing by several orders of magnitude.
  • use a system with a minimum of 8GB of physical memory and at least 4GB of configured swap space to run the enwiki dumps.
  • in the MYSQL /etc/my.cnf file, make certain max_allowed_packet is set above 20M if possible. This variable limits the size an SQL request plus the data can be during import operations. If it is set too low, mysqld will abort the any database connections to the SQL server that send requests larger than this value and halt article importing from importDump.php.
  • set the MYSQL my.cnf file to settings to maximize performance. A sample is provided below:
 /etc/my.cnf
 [mysqld]
 datadir=/var/lib/mysql
 socket=/var/lib/mysql/mysql.sock
 # Default to using old password format for compatibility with mysql 3.x
 # clients (those using the mysqlclient10 compatibility package).
 old_passwords=1
 set-variable = key_buffer_size=2GB
 set-variable = max_allowed_packet=20M   // during XML import, article size is directly related to this setting
 set-variable = table_cache=256          // max_allowed_packet=1GB is MAX for mysql 4.0 and above, 20M for 3.x
 set-variable = max_connections=500
 innodb_data_file_path = ibdata1:10M:autoextend
 # Set buffer pool size to 50-80% of your computer's memory
 innodb_buffer_pool_size=2G
 innodb_additional_mem_pool_size=40M
 #
 # Set the log file size to about 25% of the buffer pool size
 innodb_log_file_size=250M
 innodb_log_buffer_size=8M
 #
 innodb_flush_log_at_trx_commit=1
 [mysql.server]
 user=mysql
 basedir=/var/lib
 [mysqld_safe]
 log-error=/var/log/mysqld.log
 pid-file=/var/run/mysqld/mysqld.pid