Data dumps/2006 notes

From Meta, a Wikimedia project coordination wiki


Clusters[edit]

The wikis hosted in our Korean cluster will have a separate host, at http://download-yaseo.wikimedia.org/

Reporting[edit]

The backup runner script will generate some pretty HTML pages showing status as each file completes, so it should be easier to see what's done, what's in progress, and what failed.

I'm about to code up this part, shouldn't be too hard I hope. :)

File layout[edit]

This basic layout of file generation is complete in the script:

  • public/
    • dbname/
      • YYYYMMDD/
        • dbname-YYYYMMDD-all-titles-in-ns0.gz
          list of page names for BBC
        • dbname-YYYYMMDD-table.gz
          SQL table dumps
        • dbname-YYYYMMDD-pages-type.xml.bz2
        • dbname-YYYYMMDD-pages-type.xml.7z
          XML page text dumps
        • dbname-YYYYMMDD-abstract.xml.gz
          page extracts for Yahoo

Static URLs[edit]

There will probably also be a directory with symbolic links for a static URL to whatever the latest version is of each file. Will likely look like this:

  • public/
    • dbname/
      • latest/
        • dbname-all-titles-in-ns0.gz
          list of page names for BBC
        • dbname-table.gz
          SQL table dumps
        • dbname-pages-type.xml.bz2
        • dbname-pages-type.xml.7z
          XML page text dumps
        • dbname-abstract.xml.gz
          page extracts for Yahoo

Images/uploads[edit]

Not yet included, this may change in near future.