Data dumps/Download tools

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

Download tools[edit]

Downloading the XML dumps[edit]

Once you've decided what files to download, it's important to pick the correct server, probably one of the mirrors. Mirrors may be much closer to you and are usually less overloaded than dumps.wikimedia.org, which also enforces strict connection and speed limits.

For the download you can use any download manager, but you may prefer a standard command-line downloader like wget or curl which handles URL selection, resuming, retrying etc. For instance, to download the latest full dump of a wiki (Meta-Wiki in the example) from the source server, in 7z format to save on size and decompression time:

wget --recursive --no-parent --no-directories --continue --accept 7z https://dumps.wikimedia.org/metawiki/latest/

or in short:

wget -r -np -nd -c -A 7z https://dumps.wikimedia.org/metawiki/latest/

If this doesn't use the full speed of your machine and network, and you're sure you can't switch to a mirror, try the axel download accelerator (man axel) to use more connections:

axel --num-connections=3 https://dumps.wikimedia.org/metawiki/latest/metawiki-latest-pages-meta-current.xml.bz2

If you need to download several files over multiple connections, look into xargs.

If you need to download a lot of dumps, scripts are available such as WikiTeam's wikipediadownloader.py.

Downloading media[edit]

You can download media bundles for a project or use rsync to pick up media from one of our mirror sites.

Alternatively, you can use the Wikix program to read any XML dump and create a series of parallel download scripts which will run on a Linux based system. The Wikix program requires that you have the curl program installed on your Linux distribution.

The WikiTeam software provides similar capabilities as well.

Downloading XML dumps and access logs[edit]

The open source package QUAC has scripts wp-get-dumps and wp-get-access that use rsync to download from mirrors.