Talk:Data dumps/Dump format

From Meta, a Wikimedia project coordination wiki

It took me some time to understand what is contained in the data dumps. A better description of each file is needed. I have started to give some details on some of the sql files (the one that I know).

Content files[edit]

I've figured out started to figure out how the content files are organized. The full dump is in a single (very large) file:

enwiki-YYYYMMDD-pages-articles.xml.bz2

This is also broken down into smaller chunks. For the 2021-04-01 dump, for example, there's a total of 60 files with names matching the pattern (using python 3.7 re syntax)

^enwiki-20210401-pages-articles(?P<part>[0-9]+).xml-p(?P<start>[0-9]+)p(?P<end>[0-9]+).bz2$

I assume the <part> field is an artifact of a two-pass chunking process. It appears you can safely ignore it and just sort on <start> if you need to process the files in order. Note that <part>, <start>, and <end> are not padded with leading zeros, so sorting the file names alphabetically doesn't work. The <start>, and <end> fields appear to be (1-based) byte offsets from the beginning of a data stream, since <start> of one file is equal to <end> + 1 of the previous file. RoySmith (talk) 18:07, 8 April 2021 (UTC)[reply]

Alternatively, enwiki-latest-pages... instead of the specific YYYYMMDD. RoySmith (talk) 03:33, 13 April 2021 (UTC)[reply]

Update: It turns out that the p-numbers are the page_ids from the page table. So, p1p873 means this fragment contains information about pages 1-873. RoySmith (talk) 23:03, 18 April 2021 (UTC)[reply]