Data dumps/Misc dumps format
The format of XML/Sql dumps is documented here. The wikidata entity dump formats are documented here for JSON and here for RDF.
The format of the other dumps produced by Wikimedia is described below.
Category dumps
[edit]These are produced in RDF format. For each category, the following information is provided:
- Category title
- If the category is hidden
- Number of pages in the category, excluding subcategories and files
- Number of subcategories
- Urls for each category to which this category belongs
<https://en.wikipedia.org/wiki/Category:0-6-0PT_locomotives> a mediawiki:Category ; rdfs:label "0-6-0PT locomotives" ; mediawiki:pages "13"^^xsd:integer ; mediawiki:subcategories "0"^^xsd:integer . ... <https://en.wikipedia.org/wiki/Category:0-6-0PT_locomotives> mediawiki:isInCategory <https://en.wikipedia.org/wiki/Category:0-6-0T_locomotives>, <https://en.wikipedia.org/wiki/Category:0-6-0_locomotives>, <https://en.wikipedia.org/wiki/Category:Commons_category_link_is_on_Wikidata>, <https://en.wikipedia.org/wiki/Category:Tank_locomotives>, <https://en.wikipedia.org/wiki/Category:Whyte_notation>
Cirrus search index dumps
[edit]For more information, see the extension documentation on the search schema.
These files contain importable Cirrus search indexes in json format. For each entry, the following is provided:
type | one of 'page' or 'namespace' |
---|
For namespaces:
id | namespace number |
---|---|
<wikiname> | name of specific wiki |
<wikitype> | name of wiki type (wikipedia, wikivoyage, and so on) |
For pages:
auxiliary text | thumbnail captions, tables and a few other things that are searchable but not part of the primary page content |
---|---|
category | list of categories to which this page belongs |
content_model | whether the page content is wikitext, json and so on |
coordinates | geographical coordinates provided via the parser function '#coordinates', if present |
create_timestamp | date and time page was first created |
defaultsort | the sort key for sorting the page in categories which contain it, if set |
display_title | the value for the DISPLAYTITLE magic word, if set |
external_link | list of links outside off the wiki projects, made in this page |
heading | list of entries in the page content surrounded by == (so html h2 headers) |
incoming_links | number of pages that link to this page |
language | content language of the page |
namespace | number of the namespace of this page |
namespace_text | name of the namespace of this page |
opening_text | text before the first heading (h2 through h6, i.e. == through ======) |
outgoing_link | list of links made in this page that lead to other pages on wiki projects |
redirect | namespace and title of pages which redirect to this page, if any |
source text | raw text of the revision |
text | everything but the opening text and auxiliary text (after wikitext expansion) |
text_bytes | length of revision content, in bytes |
template | list of templates included by this page |
timestamp | timestamp of current revision |
title | title of page |
version | current revision id |
version_type | always 'external' when set |
wiki | name of specific wiki |
wikibase_item | the Q number of the page on wikidata (how is this even obtained??) |
{"index":{"_type":"namespace","_id":"4"}} {"name":["wikibooks"],"wiki":"simplewikibooks"} ... {"index":{"_type":"page","_id":"29012876"}} {"template":[],"redirect":[],"wikibase_item":"Q8801599","heading":[],"source_text":"[[Category:Protected areas of Maine by county|Penobscot]]\n[[Category:Geography of Penobscot County, Maine]]\n[[Category:Tourist attractions in Penobscot County, Maine]]","version_type":"external","opening_text":null,"wiki":"enwiki","coordinates":[],"auxiliary_text":[],"language":"en","title":"Protected areas of Penobscot County, Maine","version":755291624,"external_link":[],"namespace_text":"Category","namespace":14,"text_bytes":167,"incoming_links":1,"text":"","category":["Protected areas of Maine by county","Geography of Penobscot County, Maine","Tourist attractions in Penobscot County, Maine"],"defaultsort":false,"outgoing_link":[],"timestamp":"2016-12-17T05:52:40Z","content_model":"wikitext","create_timestamp":"2010-10-01T04:20:42Z"} {"index":{"_type":"page","_id":"57554132"}} {"version":843749848,"wiki":"enwiki","namespace":14,"namespace_text":"Category","title":"Turkish aerobic gymnasts","timestamp":"2018-05-31T06:08:09Z","category":["Turkish gymnasts","Aerobic gymnasts"],"external_link":[],"outgoing_link":["Portal:Gymnastics"],"template":["Template:Portal","Module:Portal","Module:Portal\/images\/g"],"text":"Gymnastics portal","source_text":"{{Portal|Gymnastics}}\n\n[[Category:Turkish gymnasts|Aerobic]]\n[[Category:Aerobic gymnasts]]","text_bytes":90,"content_model":"wikitext","coordinates":[],"language":"en","heading":[],"opening_text":null,"auxiliary_text":[],"defaultsort":false,"redirect":[],"incoming_links":1,"create_timestamp":"2018-05-31T06:08:09Z","wikibase_item":"Q55963883"} {"index":{"_type":"page","_id":"9772184"}} {"template":[],"redirect":[],"heading":[],"source_text":"hello","version_type":"external","opening_text":null,"wiki":"enwiki","coordinates":[],"auxiliary_text":[],"language":"en","title":"Copywrong~enwiki","version":657578431,"external_link":[],"namespace_text":"User","namespace":2,"text_bytes":5,"incoming_links":0,"text":"hello","category":[],"defaultsort":false,"outgoing_link":[],"timestamp":"2015-04-21T15:02:03Z","content_model":"wikitext","create_timestamp":"2007-02-28T15:46:39Z"}
Content translation dumps
[edit]The content translation dumps are provided in 3 formats, json with html, json with text, and tmx with text. 'Text' in this context means that any html markup has been stripped out; see the file excerpts below for an example.
For each entry the following are included, with field names varying according to format:
- the language of the source text (the text to be translated)
- the target language
- the source text itself
- the machine translation of the source text, and the machine translation engine used
- the target (human translated text)
For more information, see the extension documentation on published translations.
{ "id": "629016/17", "sourceLanguage": "ba", "targetLanguage": "el", "source": { "content": "<section rel=\"cx:Section\" id=\"cxSourceSection17\" data-mw-cx-source=\"undefined\"><h2 id=\"4a7502fbf361633e09fc38c09d6c5b\"><span data-segmentid=\"96\" class=\"cx-segment\">Һылтанмалар</span></h2>\n</section>" }, "mt": { "engine": "Yandex", "content": "<section rel=\"cx:Section\" id=\"cxTargetSection17\" data-mw-cx-source=\"Yandex\"><h2 id=\"4a7502fbf361633e09fc38c09d6c5b\"><span data-segmentid=\"96\" class=\"cx-segment\">Σύνδεσμος</span></h2></section>" }, "target": { "content": "<section rel=\"cx:Section\" id=\"cxTargetSection17\" data-mw-cx-source=\"Yandex\"><h2 id=\"4a7502fbf361633e09fc38c09d6c5b\"><span data-segmentid=\"96\" class=\"cx-segment\">Εξωτερικοί σύνδεσμοι</span></h2></section>" } }, ... { "id": "629016/8", "sourceLanguage": "ba", "targetLanguage": "el", "source": { "content": "<section rel=\"cx:Section\" id=\"cxSourceSection8\" data-mw-cx-source=\"undefined\"><ul id=\"mwGg\"><li id=\"mwGw\"><span data-segmentid=\"79\" class=\"cx-segment\">Башҡортостандың атҡаҙанған мәҙәниәт хеҙмәткәре (1993).</span></li></ul>\n\n</section>" }, "mt": { "engine": "Yandex", "content": "<section rel=\"cx:Section\" id=\"cxTargetSection8\" data-mw-cx-source=\"Yandex\"><ul id=\"mwGg\"><li id=\"mwGw\"><span data-segmentid=\"79\" class=\"cx-segment\">Τιμήθηκε ο εργαζόμενος πολιτισμού Χαλκίδα (1993).</span></li></ul></section>" }, "target": { "content": "<section rel=\"cx:Section\" id=\"cxTargetSection8\" data-mw-cx-source=\"Yandex\"><ul id=\"mwGg\"><li id=\"mwGw\"><span data-segmentid=\"79\" class=\"cx-segment\">Τιμημένος εργαζόμενος του πολιτισμού, Δημοκρατία του Μπασκορτοστάν (1993).</span></li></ul></section>" } },
{ "id": "501270/mwCA", "sourceLanguage": "ar", "targetLanguage": "el", "source": { "content": "المعتمديّة هي تقسيم إداري يستخدم في تونس. ويمثل المستوى الثاني للتقسيم الإداري بالجمهورية التونسية، حيث ترجع المعتمدية بالنظر إلى الولاية كما تنقسم إلى بلديات (مدن) ثم إلى عمادات (مناطق) ثم إلى مجالس قروية[1]." }, "mt": { "engine": "Yandex", "content": "Σατέν κλωστές είναι η διοικητική διαίρεση που χρησιμοποιούνται στην Τυνησία. Και το δεύτερο επίπεδο, η διοικητική διαίρεση της Δημοκρατίας της Τυνησίας, όπου το σατέν κλωστές για τα μέλη είναι επίσης χωρίζεται σε δήμους (πόλεις) και, στη συνέχεια, να την κοσμητεία της (την περίπτωση) και, στη συνέχεια, να τα συμβούλια χωριό[1]." }, "target": { "content": "Η μουταμιντίγια είναι όρος διοικητικής διαίρεσης που χρησιμοποιούνται στην Τυνησία. Ανήκει στο δεύτερο επίπεδο της διοικητικής διαίρεσης της Δημοκρατίας της Τυνησίας, όπου τα μουταμιντίγια των κυβερνείων επίσης χωρίζονται σε δήμους (πόλεις) και, στη συνέχεια, σε ιμαντάτ και στη συνέχεια σε συμβούλια χωριών[1]." } },
<tu srclang="ar"> <tuv xml:lang="ar"> <prop type="origin">source</prop> <seg>مراجع</seg> </tuv> <tuv xml:lang="el"> <prop type="origin">mt</prop> <seg>Αναφορές</seg> </tuv> <tuv xml:lang="el"> <prop type="origin">user</prop> <seg>Παραπομπές</seg> </tuv> </tu> <tu srclang="ar"> <tuv xml:lang="ar"> <prop type="origin">source</prop> <seg>معتمدية صيادة، ولاية المنستير</seg> </tuv> <tuv xml:lang="el"> <prop type="origin">mt</prop> <seg>Γραφείο Επιτρόπου κυνηγός, η εντολή του Monastir</seg> </tuv> <tuv xml:lang="el"> <prop type="origin">user</prop> <seg>Έδρα του μουταμιντίγια, στο κυβερνείο Μοναστίρ</seg> </tuv> </tu>
Image info dumps
[edit]These files come in pairs. The -local- file contains the names and upload date/times of each file uploaded locally to the wiki. The --remote- file contains a list of the files uploaded to commons that are used on the local wiki; this information is retrieved from the MediaWiki globalimagelinks table.
The first line of each file lists the field name(s); the -local- file lists img_name and img_timestamp while the -remote- file lists gil_to.
Timestamps are in YYYYMMDDHHMMSS format. File names are written as they are in the database, so spaces are converted to underscores, for example.
Sample excerpt from orwiki-20190519-local-wikiqueries.gz:
Berhampur-university_logo.png 20120222071753 MKCG_Medical_college_logo_1.svg 20120222111540 SCB_Medical_college_logo.svg 20120224093008 VSS_medical_college_logo.svg 20120227075907
Media and article title dumps
[edit]Each of these files consists of the line 'page_title' as the first line, followed by a list of titles of pages, in alphabetical order. The media titles dump lists titles of all pages in the File: namespace (6), and the page titles dump lists titles of all pages in the main (0) namespace. Titles are dumped as they are found in the database, so spaces have been converted to underscores.
Sample excerpt (from media titles dump):
page_title !!!_(Chk_Chk_Chk)_-_One_Girl_One_Boy_cover_art.jpg !!!_-_!!!_album_cover.jpg !!e!VBQQ!mM_$(KGrHqEOKi8E03iU,-u!BNP3+G6Mqw_1.jpg !0_Trombones_Like_2_Pianos.jpg !Haunu.ogg !Hero_(album).jpg
Short url dumps
[edit]These files contain a list of entries in the following format: short-url|full-url
where the short url https://w.wiki/short-url-here redirects to the full url in links on our wiki projects.
Sample excerpt:
L|https://en.wikipedia.org/wiki/LGBT M|https://en.wikipedia.org/wiki/MediaWiki N|https://en.wikipedia.org/wiki/NetHack P|https://en.wikipedia.org/wiki/Jean-Luc_Picard Q|https://www.wikidata.org/wiki/Help:Items R|https://en.wikipedia.org/wiki/Dennis_Ritchie S|https://sv.wikipedia.org/wiki/Stockholm